-
Notifications
You must be signed in to change notification settings - Fork 152
DRAFT: Create crashdumps for active sandboxes if process is crashing and creating a crashdump #992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
429ee4f to
0666756
Compare
Signed-off-by: Simon Davies <[email protected]>
0666756 to
1852abf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and clean from my point of view. Great work!
Some tests would be awesome! To know this works as expected with every change.
| let mut sa: sigaction = std::mem::zeroed(); | ||
| sa.sa_sigaction = crash_signal_handler as usize; | ||
| sa.sa_flags = SA_SIGINFO | SA_RESTART; | ||
| libc::sigemptyset(&mut sa.sa_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this works, but is setting the signal handler before retrieving the previous the way it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes when you set the handler you get back any previous handler which you can then chain to
|
|
||
| // Register with crash handler if dumps are enabled | ||
| #[cfg(feature = "crashdump")] | ||
| if vm.runtime_config().guest_core_dump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this on by default?
Imagine running a huge number of sandboxes and something hangs, and you want to terminate it and it starts dumping everything.
I know this is not a production use case, but still. Let me know what you think.
Also, I am curious what happens if we're running 100+ sandboxes of big sizes 500MiB 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that we should have a configuration option for this which turns it on or off globally and set it to false by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now we can leave it as is and introduce a configuration option if necessary.
This pull request introduces a new crash handler capability for Hyperlight, enabling automatic sandbox dump generation when the host crashes due to fatal signals (Linux) or exceptions (Windows). The system uses a global, lock-free registry for tracking sandboxes and installs platform-specific crash handlers. The implementation is gated behind a feature flag and is robust against initialization failures and recursive crashes.
Crash handler infrastructure:
crash_handler.rsmodule that manages sandbox registration and crash dump generation, using a global lock-free registry (DashMap) and lazy initialization withonce_cell. It ensures safe (for crash context) access to hypervisor pointers and robust initialization.once_cellanddashmapinCargo.tomlto support lock-free global state and lazy initialization for the crash handler subsystem.Linux-specific crash handling:
crash_handler/linux.rsmodule that installs signal handlers for fatal signals (e.g., SIGSEGV, SIGABRT) and chains to previous handlers after generating sandbox dumps. It includes robust detection for whether core dumps are enabled on the system and ensures async-signal-safety is intentionally violated only during crash handling.Safety and error handling:
Closes #966