JDK-8252533: Signal handlers should run with synchronous error signals unblocked #839

tstuefe · 2020-10-23T15:44:26Z

Hi,

may I have reviews please for the following patch which takes care to unblock synchronous error signals at the entrance of our main signal handler (including some minor code cleanups).

When a signal happens which cannot be deferred (SIGFPE, SIGILL, SIGSEGV, SIGBUS) but whose delivery is blocked, bad things happen. This is undefined territory, and we have observed the following cases:

on Linux, the default handler is invoked instead of the user handler, which in case of error signals causes the process to core immediately. This is actually one of the better outcomes, we still have a core.
on AIX and PASE both, the process just becomes unresponsive and hangs.
on HPUX - one of our internal platform - the process just vanishes without a trace.

I did not test other platforms but would guess similar things happen there. Posix documentation [4] states:
"If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated while they are blocked, the result is undefined, unless the signal was generated by the kill() function, the sigqueue() function, or the raise() function."

At the moment, undeferrable error signals are unblocked outside the signal handler (see hotspot sigmask) and, since JDK-8065895, inside the error handler (see crash_handler setup). This leaves us with a window where the hotspot signal handler runs but before he has decided to invoke fatal error handling. Inside that window, for any platform but AIX error signals are still blocked. So any crash inside them tears down the VM immediately without giving us a useful hs-err file.

On AIX they are not blocked because we added an AIX-only patch a while ago which unblocks them at the entrance of the AIX signal handler. This was before we contributed the port to OpenJDK, so no history in the official repos. But that behavior makes sense for all posix platforms.

For more details see discussion from Nov 2014 [2][3].

(Side note, these effects only show for truly synchronous error signals. You cannot artificially create such a scenario e.g. by raising SIGSEGV with kill.)

About sa_sigaction.sa_mask vs pthread_sigmask: At the first glance, it would be more elegant to just unblock error signals in the signal mask specified when installing the signal handler. However, the way that mask works is that it is ANDed together with the process signal mask in effect when the signal was raised. So if one or more of the error signals were blocked at that point, they would continue to be blocked in the handler, sa_mask notwithstanding.
It seems easier and clearer to just manually unblock those signals at the entrance of our handlers.

Changes in detail:

At the entrance of javaSignalHandler() we now unblock error signals for all platforms, not just for AIX.
Nothing changes at the entrance of the secondary error handler (crash_handler()): just better code reuse.
In os::signal(), for AIX only, we used to set the sa_mask for the handler-to-be-installed to unblock error signals. I decided to just remove this section. The reason is: I want to get rid of this AIX specific section, and was torn between leaving it in for all platforms or removing it. os::signal is used to install other signal handlers too, and I do not want to change their behavior with this patch. Therefore I just removed this section.

Tests:

I tested the change manually by triggering a segfault inside javaSignalHandler. That reliably tore down the process immediately. With this patch, once error signals are unblocked, we continue to run - we get the secondary crash, which then leads to VMError::report_and_die() called since its a real error (I prevented recursion for this test. See my comments about recursion in the JBS issue.)

We will put this through our SAP internal tests before pushing.

[1] https://bugs.openjdk.java.net/browse/JDK-8065895
[2] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2014-November/013346.html
[3] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-January/013718.html
[4] https://pubs.opengroup.org/onlinepubs/009695399/functions/sigprocmask.html

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Testing

	Linux x32	Linux x64	Windows x64	macOS x64
Build	⏳ (1/1 running)	⏳ (5/5 running)	⏳ (2/2 running)	⏳ (2/2 running)

Issue

JDK-8252533: Signal handlers should run with synchronous error signals unblocked

Reviewers

Gerard Ziemski (@gerard-ziemski - Committer) ⚠️ Review applies to eabe3e3
David Holmes (@dholmes-ora - Reviewer) ⚠️ Review applies to 8d86e25

Download

$ git fetch https://git.openjdk.java.net/jdk pull/839/head:pull/839
$ git checkout pull/839

bridgekeeper · 2020-10-23T15:44:39Z

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2020-10-23T15:45:47Z

@tstuefe The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2020-10-23T19:30:42Z

Webrevs

dholmes-ora · 2020-10-26T02:23:56Z

Hi Thomas,

I think this has highlighted a number of bugs with our signal handling code in general. It makes no sense to ever block these synchronous signals, so any code that does so seems flawed. You also point out the semantics of sa_sigaction.sa_mask which highlights that call_chained_handler is incorrect as it tries to emulate the conditions under which the original handler would have been called, but it explicitly sets a sigmask from sa_mask instead of combining it with the existing sigmask.

Aside: I also noticed the SR_sigset is unused, so unclear if we have a bug lurking in PosixSignals::SR_initialize. There is some commentary about the blocked/unblocked status of SR_signum that also seems wrong.

That all said ... AFAICS the synchronous signals are all explicitly unblocked via PosixSignals::hotspot_sigmask which is set when a new thread starts executing, or an existing thread attaches. So from that point we should only need to ensure those signals are not explicitly blocked - which means that in os::signal we unblock them in sa_mask for all platforms. Does that not suffice? Or are you concerned about user native code changing the signal masks arbitrarily? In which case re-setting in each of the VM's signal handlers would be reasonable.

Note that SIGTRAP is currently mostly only processed under #if defined(PPC64)and/or defined(AIX).

Thanks,
David

tstuefe · 2020-10-26T11:43:54Z

Hi Thomas,

I think this has highlighted a number of bugs with our signal handling code in general. It makes no sense to ever block these synchronous signals, so any code that does so seems flawed. You also point out the semantics of sa_sigaction.sa_mask which highlights that call_chained_handler is incorrect as it tries to emulate the conditions under which the original handler would have been called, but it explicitly sets a sigmask from sa_mask instead of combining it with the existing sigmask.

Yes, this coding may benefit from a bit more grooming (but preferably in another RFE).

Aside: I also noticed the SR_sigset is unused, so unclear if we have a bug lurking in PosixSignals::SR_initialize. There is some commentary about the blocked/unblocked status of SR_signum that also seems wrong.

That all said ... AFAICS the synchronous signals are all explicitly unblocked via PosixSignals::hotspot_sigmask which is set when a new thread starts executing, or an existing thread attaches. So from that point we should only need to ensure those signals are not explicitly blocked - which means that in os::signal we unblock them in sa_mask for all platforms. Does that not suffice? Or are you concerned about user native code changing the signal masks arbitrarily? In which case re-setting in each of the VM's signal handlers would be reasonable.

You are right. This is what I tried to say with "sa_sigaction.sa_mask vs pthread_sigmask".

I chose explicit unblocking over setting the signal mask when establishing the handler because I was concerned about someone changing the thread signal mask after handler is installed and before the signal is triggered. Especially I am not sure what happens if some native code calls sigprocmask() - does that affect all threads, only this thread? Posix says its undefined. Because I did not want to think about all this, and because I found it slightly more obvious, I unblock explicitly.

That said, it was a bit of a coin toss, and if you dislike this, we can do it via sigaction.sa_mask instead. The danger of someone blocking synchronous error signals under our nose is probably remote :)

Note that SIGTRAP is currently mostly only processed under #if defined(PPC64)and/or defined(AIX).

Yes, we use it for implicit null checks. But this has only meaning outside of signal handling. I unblocked it inside signal handlers because I found vague reports of processes crashing with blocked SIGTRAPs and thought that the same reasoning may apply to this signal wrt to blocking in signal handlers.

Blocking SIGTRAP has been handled inconsistently, too. It had been unblocked at the entrance of the secondary error handler for all platforms (see vmError_posix.cpp) but not in general error handling. I think that is an oversight.

Thanks,
David

Thanks, Thomas

gerard-ziemski · 2020-10-27T20:39:17Z

hi Thomas,

Looks good as far as I can tell.

I'm still learning the POSIX signal code, so excuse the question, but if our signal handlers are capable of handling the synchronous signals while doing fatal error handling, then why don't we unblock it for all the signals, so that we can handle any kind of a situation inside the fatal error handling?

Also, since we're touching the "VMError::reset_signal_handlers()" any chance we can improve on the name here, as you yourself suggested in other review?

cheers

tstuefe · 2020-10-28T06:36:10Z

hi Thomas,

Looks good as far as I can tell.

Thank you!

I'm still learning the POSIX signal code, so excuse the question, but if our signal handlers are capable of handling the synchronous signals while doing fatal error handling, then why don't we unblock it for all the signals, so that we can handle any kind of a situation inside the fatal error handling?

Its unnecessary and safer.

When signals which can be deferred (lets say, SIGPIPE or SIGCHILD) happen while they are blocked in the receiving thread's signal mask, nothing really happens: they are just added to an internal kernel queue for the time being, until the receiving thread unblocks, eg. by returning from signal handling.

Were they unblocked and happen to be delivered to the thread during fatal error handling - so, the reporting thread is somewhere inside VMError::report(), working on one of the error reporting steps - the error reporting step gets interrupted and we execute (in a new call frame):

crash_handler(SIGPIPE) ->
VMError::report_and_die(SIGPIPE) -> _first_error_tid == our tid ->
we print "[Error occurred during error reporting" into the hs-err file ->
re-enter VM::report() again.

Now call stack looks like this:

<OS signal handler 1>
javaSignalHandler & JVM_handle_xx_signal
VMError::report_and_die
VMError::report
<OS signal handler 2>
crash_handler
VMError::report_and_die
VMError::report

When re-entering VMError::report(), we skip the error reporting step during which we got the second signal, because we assume it was an error signal. We just continue with the next step.

The whole thing is quite ingenious in that it allows us to do all kind of unsafe things for the sake of error reporting. If one step fails, it does not endanger the next step. Also, we never unwind the stack, so when we dump the core after error handling, we have the full stack (plus 0-n incarnations of above frames). Unwinding the stack may also be dangerous since the stack may be corrupted.

The only caveat is that we artificially limit the number of recursive error signals that can happen to 30 (not sure why) and of course we may run out of stack after too many recursive errors.

Bottomline, that whole mechanism is written with synchronous error signals in mind, and we would have to rethink it if we allow all signals (eg it would not be necessary to skip an error reporting step because we get a SIGPIPE). Since blocking works fine for those signals, there is no need to do this.

Also, since we're touching the "VMError::reset_signal_handlers()" any chance we can improve on the name here, as you yourself suggested in other review?

Sure, but I'd like to do this in a separate patch. I was actually counting on you in that other patch :) but I have some vague cleanup in mind at some point, unless you are faster.

Cheers, Thomas

cheers

dholmes-ora

Hi Thomas,

I still feel that we should never explicitly block these error signals given it leads to undefined behaviour. So I'd like to see a change to os::signal at least in a future RFE.

For now I think all the signal handlers should explicitly unblock the error signals, which means the SR_handler also needs updating.

Some other minor changes requested below.

Thanks,
David

dholmes-ora · 2020-10-29T02:21:13Z

src/hotspot/os/posix/signals_posix.cpp

+  int rc = ::pthread_sigmask(SIG_UNBLOCK, &set, NULL);
+#ifdef ASSERT
+  if (rc != 0) {
+    log_warning(os)("pthread_sigmask failed (%d)", errno);


pthread_sigmask doesn't set errno, it returns the error code directly. But as there is only one defined error code for pthread_sigmask (EINVAL) this warning seem unnecessary. The usual assert_status would suffice.

Good catch.

I thought about assert too. assert is not a good idea, since we would trigger our assertion poison page, and since segv may still be blocked, this would cause exactly the problem I want to prevent.

But since the only error is EINVAL, and we know what we feed into this function, I'll just remove the assert.

src/hotspot/os/posix/vmError_posix.cpp

tstuefe · 2020-10-29T06:46:12Z

Hi Thomas,

I still feel that we should never explicitly block these error signals given it leads to undefined behaviour. So I'd like to see a change to os::signal at least in a future RFE.

Not a problem, that is reasonable. No need to make a new RFE. I will prepare a new version which excludes those signals from the sigaction sa_mask too, but leaves the unblocking in place. And I'll add the SR_handler. Since this will probably take some days to get tested again, I'll be happy if you review it after your vacation :)

Cheers, Thomas

For now I think all the signal handlers should explicitly unblock the error signals, which means the SR_handler also needs updating.

Some other minor changes requested below.

Thanks,
David

mlbridge · 2020-10-29T07:02:44Z

Mailing list message from David Holmes on hotspot-runtime-dev:

Hi Thomas,

On 29/10/2020 4:48 pm, Thomas Stuefe wrote:

On Thu, 29 Oct 2020 02:43:40 GMT, David Holmes <dholmes at openjdk.org> wrote:

Hi Thomas,

I still feel that we should never explicitly block these error signals given it leads to undefined behaviour. So I'd like to see a change to os::signal at least in a future RFE.

Not a problem, that is reasonable. No need to make a new RFE. I will prepare a new version which excludes those signals from the sigaction sa_mask too, but leaves the unblocking in place. And I'll add the SR_handler. Since this will probably take some days to get tested again, I'll be happy if you review it after your vacation :)

Okay.

Cheers, Thomas

For now I think all the signal handlers should explicitly unblock the error signals, which means the SR_handler also needs updating.

Some other minor changes requested below.

Thanks,
David

src/hotspot/os/posix/vmError_posix.cpp line 105:

103: static void crash_handler(int sig, siginfo_t* info, void* ucVoid) {
104:
105: PosixSignals::unblock_error_signals();

Just to be clear this crash_handler is only installed for the error signals, so then original code was redundantly processing "sig" twice?

I'm unsure what you mean. Do you mean, it had been added twice to newset?

sigaddset(&newset, sig); <<
// also unmask other synchronous signals
for (int i = 0; i < NUM_SIGNALS; i++) {
sigaddset(&newset, SIGNALS[i]); <<
}
PosixSignals::unblock_thread_signal_mask(&newset);

That is true. I believe the original coding was from us too, so I feel responsible :)

Right - if the only expected values for sig were in fact one of those
defined in SIGNALS then it gets added twice.

Cheers,
David
-----

tstuefe · 2020-10-29T07:38:05Z

Hi Thomas,

.....

Aside: I also noticed the SR_sigset is unused, so unclear if we have a bug lurking in PosixSignals::SR_initialize. There is some commentary about the blocked/unblocked status of SR_signum that also seems wrong.

I find the code in SR_initialize strange and very probably broken. I created https://bugs.openjdk.java.net/browse/JDK-8255577 to keep track of this. Its on my name but if you'd like to take this on be my guest.

tstuefe · 2020-10-29T11:59:04Z

Changed in the new version:

As David requested, we now exclude synchronous error signals from the signal masks used when registering a signal handler (sa_action.sa_mask). This means that unblocking them explicitly inside the signal handlers is only an additional safety measure now (the signal mask at the entrance of a signal handler is the result of an & operation between sa_action.sa_mask and whatever mask was in effect when the signal happened).
As David requested, now the SR_handler unblocks error signals too, as well as the UserHandler.
I wrote a big comment explaining what I do and why I do it.

tstuefe · 2020-11-02T10:14:07Z

Tests ran in our nightlies on all platforms successfully, including ppc, s390, aarch64.

dholmes-ora

Hi Thomas,

A couple of minor nits but otherwise seems good to go.

We should share these error signals between signals_posix and vmError_posix at some stage.

Thanks,
David

dholmes-ora · 2020-11-03T06:14:34Z

src/hotspot/os/posix/signals_posix.cpp

+//  absolutely sure they won't happen. In practice, this is always.
+//
+// Relevant Posix quote:
+// <quote>


This line and </quote> are not needed

dholmes-ora · 2020-11-03T06:15:33Z

src/hotspot/os/posix/signals_posix.cpp

+// </quote>
+//
+// We also include SIGTRAP in that list of never-to-block-signals. While not
+//  mentioned by the Posix documentation, in our (SAPs) experience blocking it


Seems to be an extra space after // on this and next couple of line.

openjdk · 2020-11-03T06:22:58Z

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8252533: Signal handlers should run with synchronous error signals unblocked

Reviewed-by: gziemski, dholmes

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 174 new commits pushed to the master branch:

6d36b4b: 8255743: Relax SIGFPE match in in runtime/ErrorHandling/SecondaryErrorTest.java
f0eeca9: 8255718: Zero: VM should know it runs in interpreter-only mode
9beb866: 8233561: [TESTBUG] Swing text test bug8014863.java fails on macos
fe4e6b3: 8196089: javax/swing/Action/8133039/bug8133039.java fails
50357d1: 8254723: add diagnostic command to write Linux perf map file
f97ec35: 8255785: X11 libraries should not be required by configure for headless only
184db64: 8255732: OpenJDK fails to build if $A is set to a value with spaces
c774741: 8255695: Some JVMTI calls in the jdwp debug agent are using FUNC_PTR instead of JVMTI_FUNC_PTR
bee864f: 8255766: Fix linux+arm64 build after 8254072
ceba2f8: 8255696: JDWP debug agent's canSuspendResumeThreadLists() should be removed
... and 164 more: https://git.openjdk.java.net/jdk/compare/a0b687bfb279761b79d3fbad8f9dfae9b12c225e...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

tstuefe · 2020-11-03T06:41:13Z

Hi Thomas,

A couple of minor nits but otherwise seems good to go.

We should share these error signals between signals_posix and vmError_posix at some stage.

I agree.

Thanks,
David

Thanks a lot, David!

I'll fix those last nits and then I'll push.

tstuefe · 2020-11-03T07:13:23Z

/integrate

openjdk · 2020-11-03T07:20:25Z

@tstuefe Since your change was applied there have been 174 commits pushed to the master branch:

6d36b4b: 8255743: Relax SIGFPE match in in runtime/ErrorHandling/SecondaryErrorTest.java
f0eeca9: 8255718: Zero: VM should know it runs in interpreter-only mode
9beb866: 8233561: [TESTBUG] Swing text test bug8014863.java fails on macos
fe4e6b3: 8196089: javax/swing/Action/8133039/bug8133039.java fails
50357d1: 8254723: add diagnostic command to write Linux perf map file
f97ec35: 8255785: X11 libraries should not be required by configure for headless only
184db64: 8255732: OpenJDK fails to build if $A is set to a value with spaces
c774741: 8255695: Some JVMTI calls in the jdwp debug agent are using FUNC_PTR instead of JVMTI_FUNC_PTR
bee864f: 8255766: Fix linux+arm64 build after 8254072
ceba2f8: 8255696: JDWP debug agent's canSuspendResumeThreadLists() should be removed
... and 164 more: https://git.openjdk.java.net/jdk/compare/a0b687bfb279761b79d3fbad8f9dfae9b12c225e...master

Your commit was automatically rebased without conflicts.

Pushed as commit e7a2d5c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

…blocked Reviewed-by: gziemski, dholmes

Initial patch

eabe3e3

openjdk bot added the hotspot-runtime [email protected] label Oct 23, 2020

tstuefe marked this pull request as ready for review October 23, 2020 19:26

openjdk bot added the rfr Pull request is ready for review label Oct 23, 2020

gerard-ziemski approved these changes Oct 28, 2020

View reviewed changes

dholmes-ora suggested changes Oct 29, 2020

View reviewed changes

tstuefe marked this pull request as draft October 29, 2020 08:48

Feedback David

8d86e25

openjdk bot removed the rfr Pull request is ready for review label Oct 29, 2020

tstuefe marked this pull request as ready for review October 29, 2020 11:54

openjdk bot added the rfr Pull request is ready for review label Oct 29, 2020

dholmes-ora approved these changes Nov 3, 2020

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Nov 3, 2020

last nits (comments only) ironed out

d9cbdb6

openjdk bot closed this Nov 3, 2020

openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 3, 2020

openjdk-notifier bot referenced this pull request Nov 3, 2020

8252533: Signal handlers should run with synchronous error signals un…

e7a2d5c

…blocked Reviewed-by: gziemski, dholmes

tstuefe deleted the JDK-8252533-signal-handler-should-not-block-synchronous-error-signals branch November 4, 2020 05:09

JDK-8252533: Signal handlers should run with synchronous error signals unblocked #839

JDK-8252533: Signal handlers should run with synchronous error signals unblocked #839

Uh oh!

Conversation

tstuefe commented Oct 23, 2020 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Testing

Issue

Reviewers

Download

Uh oh!

bridgekeeper bot commented Oct 23, 2020

Uh oh!

openjdk bot commented Oct 23, 2020

Uh oh!

mlbridge bot commented Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

dholmes-ora commented Oct 26, 2020

Uh oh!

tstuefe commented Oct 26, 2020

Uh oh!

gerard-ziemski commented Oct 27, 2020

Uh oh!

tstuefe commented Oct 28, 2020

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

tstuefe Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tstuefe commented Oct 29, 2020

Uh oh!

mlbridge bot commented Oct 29, 2020

Uh oh!

tstuefe commented Oct 29, 2020

Uh oh!

tstuefe commented Oct 29, 2020

Uh oh!

tstuefe commented Nov 2, 2020

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tstuefe commented Nov 3, 2020

Uh oh!

tstuefe commented Nov 3, 2020

Uh oh!

openjdk bot commented Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

tstuefe commented Oct 23, 2020 •

edited by openjdk bot

Loading

mlbridge bot commented Oct 23, 2020 •

edited

Loading

openjdk bot commented Nov 3, 2020 •

edited

Loading