-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add test for dying with dignity #28987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have long wanted an actual test that dying with dignity works. It is tricky because if dying with dignity works, it means the test JVM dies which is usually an abnormal condition. And anyway, how does one force a fatal error to be thrown. I was motivated to investigate this again by the fact that I missed a backport to one branch leading to an issue where Elasticsearch would not successfully die with dignity. And now we have a solution: we install a plugin that throws an out of memory error when it receives a request. We hack the standalone test infrastructure to prevent this from failing the test. To do this, we bypass the security manager and remove the PID file for the node; this tricks the test infrastructure into thinking that it does not need to stop the node. We also bypass seccomp so that we can fork jps to make sure that Elasticsearch really died. And to be extra paranoid, we parse the logs of the dead Elasticsearch process to make sure it died with dignity. Never forget.
|
Pinging @elastic/es-core-infra |
hub-cap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I would like another human to review
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice to see how it is still possible to test this kind of stuff!
Too bad the PR build failed, because a JRE is installed instead of a JDK :(
qa/die-with-dignity/build.gradle
Outdated
| classname 'org.elasticsearch.DieWithDignityPlugin' | ||
| } | ||
|
|
||
| integTestCluster { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this empty block needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 881a8d2.
| final int pid = Integer.parseInt(pidFileLines.get(0)); | ||
| Files.delete(pidFile); | ||
| final CountDownLatch latch = new CountDownLatch(1); | ||
| client().performRequestAsync("GET", "/_die_with_dignity", new ResponseListener() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not performRequest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a long story how it ended up this way that is not interesting, but now it is no longer needed. It's leftover from a previous iteration. I pushed 14c720e.
rjernst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test looks fine, I just have one comment about the name. I know we have used this phrase internally to mean "let the node die when an error occurs", but thus far we have only used it in PR titles, AFAIK. I think for actual code, we should use a more plain name that will not be lost in translation, or be only known to those that have been around working on ES in the 2.x-6.x range. I think something like "qa/error-exit" or "qa/exit-on-error" would be much more clear to anyone browsing through the qa tests who did not know this "insider" phrase.
|
I disagree, because we refer to this as "die with dignity" everywhere. It has appeared in release notes too. And if ever we were to write a blog post about why we do this, it would definitely refer to this as "dying with dignity". It's a fun name. |
I have long wanted an actual test that dying with dignity works. It is tricky because if dying with dignity works, it means the test JVM dies which is usually an abnormal condition. And anyway, how does one force a fatal error to be thrown. I was motivated to investigate this again by the fact that I missed a backport to one branch leading to an issue where Elasticsearch would not successfully die with dignity. And now we have a solution: we install a plugin that throws an out of memory error when it receives a request. We hack the standalone test infrastructure to prevent this from failing the test. To do this, we bypass the security manager and remove the PID file for the node; this tricks the test infrastructure into thinking that it does not need to stop the node. We also bypass seccomp so that we can fork jps to make sure that Elasticsearch really died. And to be extra paranoid, we parse the logs of the dead Elasticsearch process to make sure it died with dignity. Never forget.
I have long wanted an actual test that dying with dignity works. It is tricky because if dying with dignity works, it means the test JVM dies which is usually an abnormal condition. And anyway, how does one force a fatal error to be thrown. I was motivated to investigate this again by the fact that I missed a backport to one branch leading to an issue where Elasticsearch would not successfully die with dignity. And now we have a solution: we install a plugin that throws an out of memory error when it receives a request. We hack the standalone test infrastructure to prevent this from failing the test. To do this, we bypass the security manager and remove the PID file for the node; this tricks the test infrastructure into thinking that it does not need to stop the node. We also bypass seccomp so that we can fork jps to make sure that Elasticsearch really died. And to be extra paranoid, we parse the logs of the dead Elasticsearch process to make sure it died with dignity. Never forget.
Relates #19272