-
Notifications
You must be signed in to change notification settings - Fork 6.2k
JDK-8252533: Signal handlers should run with synchronous error signals unblocked #839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into |
Webrevs
|
|
Hi Thomas, I think this has highlighted a number of bugs with our signal handling code in general. It makes no sense to ever block these synchronous signals, so any code that does so seems flawed. You also point out the semantics of sa_sigaction.sa_mask which highlights that call_chained_handler is incorrect as it tries to emulate the conditions under which the original handler would have been called, but it explicitly sets a sigmask from sa_mask instead of combining it with the existing sigmask. Aside: I also noticed the SR_sigset is unused, so unclear if we have a bug lurking in PosixSignals::SR_initialize. There is some commentary about the blocked/unblocked status of SR_signum that also seems wrong. That all said ... AFAICS the synchronous signals are all explicitly unblocked via PosixSignals::hotspot_sigmask which is set when a new thread starts executing, or an existing thread attaches. So from that point we should only need to ensure those signals are not explicitly blocked - which means that in os::signal we unblock them in sa_mask for all platforms. Does that not suffice? Or are you concerned about user native code changing the signal masks arbitrarily? In which case re-setting in each of the VM's signal handlers would be reasonable. Note that SIGTRAP is currently mostly only processed under #if defined(PPC64)and/or defined(AIX). Thanks, |
Yes, this coding may benefit from a bit more grooming (but preferably in another RFE).
You are right. This is what I tried to say with "sa_sigaction.sa_mask vs pthread_sigmask". I chose explicit unblocking over setting the signal mask when establishing the handler because I was concerned about someone changing the thread signal mask after handler is installed and before the signal is triggered. Especially I am not sure what happens if some native code calls sigprocmask() - does that affect all threads, only this thread? Posix says its undefined. Because I did not want to think about all this, and because I found it slightly more obvious, I unblock explicitly. That said, it was a bit of a coin toss, and if you dislike this, we can do it via sigaction.sa_mask instead. The danger of someone blocking synchronous error signals under our nose is probably remote :)
Yes, we use it for implicit null checks. But this has only meaning outside of signal handling. I unblocked it inside signal handlers because I found vague reports of processes crashing with blocked SIGTRAPs and thought that the same reasoning may apply to this signal wrt to blocking in signal handlers. Blocking SIGTRAP has been handled inconsistently, too. It had been unblocked at the entrance of the secondary error handler for all platforms (see vmError_posix.cpp) but not in general error handling. I think that is an oversight.
Thanks, Thomas |
|
hi Thomas, Looks good as far as I can tell. I'm still learning the POSIX signal code, so excuse the question, but if our signal handlers are capable of handling the synchronous signals while doing fatal error handling, then why don't we unblock it for all the signals, so that we can handle any kind of a situation inside the fatal error handling? Also, since we're touching the "VMError::reset_signal_handlers()" any chance we can improve on the name here, as you yourself suggested in other review? cheers |
Thank you!
Its unnecessary and safer. When signals which can be deferred (lets say, SIGPIPE or SIGCHILD) happen while they are blocked in the receiving thread's signal mask, nothing really happens: they are just added to an internal kernel queue for the time being, until the receiving thread unblocks, eg. by returning from signal handling. Were they unblocked and happen to be delivered to the thread during fatal error handling - so, the reporting thread is somewhere inside VMError::report(), working on one of the error reporting steps - the error reporting step gets interrupted and we execute (in a new call frame): crash_handler(SIGPIPE) -> Now call stack looks like this: When re-entering VMError::report(), we skip the error reporting step during which we got the second signal, because we assume it was an error signal. We just continue with the next step. The whole thing is quite ingenious in that it allows us to do all kind of unsafe things for the sake of error reporting. If one step fails, it does not endanger the next step. Also, we never unwind the stack, so when we dump the core after error handling, we have the full stack (plus 0-n incarnations of above frames). Unwinding the stack may also be dangerous since the stack may be corrupted. The only caveat is that we artificially limit the number of recursive error signals that can happen to 30 (not sure why) and of course we may run out of stack after too many recursive errors. Bottomline, that whole mechanism is written with synchronous error signals in mind, and we would have to rethink it if we allow all signals (eg it would not be necessary to skip an error reporting step because we get a SIGPIPE). Since blocking works fine for those signals, there is no need to do this.
Sure, but I'd like to do this in a separate patch. I was actually counting on you in that other patch :) but I have some vague cleanup in mind at some point, unless you are faster. Cheers, Thomas
|
dholmes-ora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thomas,
I still feel that we should never explicitly block these error signals given it leads to undefined behaviour. So I'd like to see a change to os::signal at least in a future RFE.
For now I think all the signal handlers should explicitly unblock the error signals, which means the SR_handler also needs updating.
Some other minor changes requested below.
Thanks,
David
| int rc = ::pthread_sigmask(SIG_UNBLOCK, &set, NULL); | ||
| #ifdef ASSERT | ||
| if (rc != 0) { | ||
| log_warning(os)("pthread_sigmask failed (%d)", errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pthread_sigmask doesn't set errno, it returns the error code directly. But as there is only one defined error code for pthread_sigmask (EINVAL) this warning seem unnecessary. The usual assert_status would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
I thought about assert too. assert is not a good idea, since we would trigger our assertion poison page, and since segv may still be blocked, this would cause exactly the problem I want to prevent.
But since the only error is EINVAL, and we know what we feed into this function, I'll just remove the assert.
Not a problem, that is reasonable. No need to make a new RFE. I will prepare a new version which excludes those signals from the sigaction sa_mask too, but leaves the unblocking in place. And I'll add the SR_handler. Since this will probably take some days to get tested again, I'll be happy if you review it after your vacation :) Cheers, Thomas
|
|
Mailing list message from David Holmes on hotspot-runtime-dev: Hi Thomas, On 29/10/2020 4:48 pm, Thomas Stuefe wrote:
Okay.
Right - if the only expected values for sig were in fact one of those Cheers, |
.....
I find the code in SR_initialize strange and very probably broken. I created https://bugs.openjdk.java.net/browse/JDK-8255577 to keep track of this. Its on my name but if you'd like to take this on be my guest. |
|
Changed in the new version:
|
|
Tests ran in our nightlies on all platforms successfully, including ppc, s390, aarch64. |
dholmes-ora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thomas,
A couple of minor nits but otherwise seems good to go.
We should share these error signals between signals_posix and vmError_posix at some stage.
Thanks,
David
| // absolutely sure they won't happen. In practice, this is always. | ||
| // | ||
| // Relevant Posix quote: | ||
| // <quote> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line and </quote> are not needed
| // </quote> | ||
| // | ||
| // We also include SIGTRAP in that list of never-to-block-signals. While not | ||
| // mentioned by the Posix documentation, in our (SAPs) experience blocking it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be an extra space after // on this and next couple of line.
|
@tstuefe This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 174 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
I agree.
Thanks a lot, David! I'll fix those last nits and then I'll push. |
|
/integrate |
|
@tstuefe Since your change was applied there have been 174 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit e7a2d5c. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
…blocked Reviewed-by: gziemski, dholmes
Hi,
may I have reviews please for the following patch which takes care to unblock synchronous error signals at the entrance of our main signal handler (including some minor code cleanups).
When a signal happens which cannot be deferred (SIGFPE, SIGILL, SIGSEGV, SIGBUS) but whose delivery is blocked, bad things happen. This is undefined territory, and we have observed the following cases:
I did not test other platforms but would guess similar things happen there. Posix documentation [4] states:
"If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated while they are blocked, the result is undefined, unless the signal was generated by the kill() function, the sigqueue() function, or the raise() function."
At the moment, undeferrable error signals are unblocked outside the signal handler (see hotspot sigmask) and, since JDK-8065895, inside the error handler (see crash_handler setup). This leaves us with a window where the hotspot signal handler runs but before he has decided to invoke fatal error handling. Inside that window, for any platform but AIX error signals are still blocked. So any crash inside them tears down the VM immediately without giving us a useful hs-err file.
On AIX they are not blocked because we added an AIX-only patch a while ago which unblocks them at the entrance of the AIX signal handler. This was before we contributed the port to OpenJDK, so no history in the official repos. But that behavior makes sense for all posix platforms.
For more details see discussion from Nov 2014 [2][3].
(Side note, these effects only show for truly synchronous error signals. You cannot artificially create such a scenario e.g. by raising SIGSEGV with kill.)
About sa_sigaction.sa_mask vs pthread_sigmask: At the first glance, it would be more elegant to just unblock error signals in the signal mask specified when installing the signal handler. However, the way that mask works is that it is ANDed together with the process signal mask in effect when the signal was raised. So if one or more of the error signals were blocked at that point, they would continue to be blocked in the handler, sa_mask notwithstanding.
It seems easier and clearer to just manually unblock those signals at the entrance of our handlers.
Changes in detail:
Tests:
I tested the change manually by triggering a segfault inside javaSignalHandler. That reliably tore down the process immediately. With this patch, once error signals are unblocked, we continue to run - we get the secondary crash, which then leads to VMError::report_and_die() called since its a real error (I prevented recursion for this test. See my comments about recursion in the JBS issue.)
We will put this through our SAP internal tests before pushing.
[1] https://bugs.openjdk.java.net/browse/JDK-8065895
[2] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2014-November/013346.html
[3] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-January/013718.html
[4] https://pubs.opengroup.org/onlinepubs/009695399/functions/sigprocmask.html
Progress
Testing
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/839/head:pull/839$ git checkout pull/839