Skip to content

Conversation

@jasontedor
Copy link
Member

I have long wanted an actual test that dying with dignity works. It is tricky because if dying with dignity works, it means the test JVM dies which is usually an abnormal condition. And anyway, how does one force a fatal error to be thrown. I was motivated to investigate this again by the fact that I missed a backport to one branch leading to an issue where Elasticsearch would not successfully die with dignity. And now we have a solution: we install a plugin that throws an out of memory error when it receives a request. We hack the standalone test infrastructure to prevent this from failing the test. To do this, we bypass the security manager and remove the PID file for the node; this tricks the test infrastructure into thinking that it does not need to stop the node. We also bypass seccomp so that we can fork jps to make sure that Elasticsearch really died. And to be extra paranoid, we parse the logs of the dead Elasticsearch process to make sure it died with dignity. Never forget.

Relates #19272

I have long wanted an actual test that dying with dignity works. It is
tricky because if dying with dignity works, it means the test JVM dies
which is usually an abnormal condition. And anyway, how does one force a
fatal error to be thrown. I was motivated to investigate this again by
the fact that I missed a backport to one branch leading to an issue
where Elasticsearch would not successfully die with dignity. And now we
have a solution: we install a plugin that throws an out of memory error
when it receives a request. We hack the standalone test infrastructure
to prevent this from failing the test. To do this, we bypass the
security manager and remove the PID file for the node; this tricks the
test infrastructure into thinking that it does not need to stop the
node. We also bypass seccomp so that we can fork jps to make sure that
Elasticsearch really died. And to be extra paranoid, we parse the logs
of the dead Elasticsearch process to make sure it died with
dignity. Never forget.
@jasontedor jasontedor added >test Issues or PRs that are addressing/adding tests review v7.0.0 v6.3.0 :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. labels Mar 12, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Contributor

@hub-cap hub-cap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I would like another human to review

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice to see how it is still possible to test this kind of stuff!

Too bad the PR build failed, because a JRE is installed instead of a JDK :(

classname 'org.elasticsearch.DieWithDignityPlugin'
}

integTestCluster {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this empty block needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 881a8d2.

final int pid = Integer.parseInt(pidFileLines.get(0));
Files.delete(pidFile);
final CountDownLatch latch = new CountDownLatch(1);
client().performRequestAsync("GET", "/_die_with_dignity", new ResponseListener() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not performRequest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a long story how it ended up this way that is not interesting, but now it is no longer needed. It's leftover from a previous iteration. I pushed 14c720e.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test looks fine, I just have one comment about the name. I know we have used this phrase internally to mean "let the node die when an error occurs", but thus far we have only used it in PR titles, AFAIK. I think for actual code, we should use a more plain name that will not be lost in translation, or be only known to those that have been around working on ES in the 2.x-6.x range. I think something like "qa/error-exit" or "qa/exit-on-error" would be much more clear to anyone browsing through the qa tests who did not know this "insider" phrase.

@jasontedor
Copy link
Member Author

I disagree, because we refer to this as "die with dignity" everywhere. It has appeared in release notes too. And if ever we were to write a blog post about why we do this, it would definitely refer to this as "dying with dignity". It's a fun name.

@jasontedor jasontedor merged commit 8b6fbe2 into elastic:master Mar 13, 2018
jasontedor added a commit that referenced this pull request Mar 13, 2018
I have long wanted an actual test that dying with dignity works. It is
tricky because if dying with dignity works, it means the test JVM dies
which is usually an abnormal condition. And anyway, how does one force a
fatal error to be thrown. I was motivated to investigate this again by
the fact that I missed a backport to one branch leading to an issue
where Elasticsearch would not successfully die with dignity. And now we
have a solution: we install a plugin that throws an out of memory error
when it receives a request. We hack the standalone test infrastructure
to prevent this from failing the test. To do this, we bypass the
security manager and remove the PID file for the node; this tricks the
test infrastructure into thinking that it does not need to stop the
node. We also bypass seccomp so that we can fork jps to make sure that
Elasticsearch really died. And to be extra paranoid, we parse the logs
of the dead Elasticsearch process to make sure it died with
dignity. Never forget.
@jasontedor jasontedor deleted the never-forget branch March 13, 2018 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. >test Issues or PRs that are addressing/adding tests v6.3.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants