Skip to content

Conversation

@davidkyle
Copy link
Member

Rarely the datafeed under test may detect the job has failed and terminate itself. The job failure is an expected part of the test but the test does not account for the datafeed stopping and having no persistent task. If the datafeed has stopped then don't set its state to stopping

Details of how this happens are in #52608 (comment)

Closes #52608

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v7.7.0 v7.6.1 labels Feb 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that if task == null then the test isn't doing what it's supposed to be doing, which is testing that you can stop an unassigned stopping datafeed.

Since this happens so rarely we could put the whole test in a loop and retry a few times if this situation occurs.

But if you don't want to do that then it would be better to use assumeFalse("Test setup did not create the required conditions", task == null); because then at least the test will be reported as having been skipped rather than silently succeeding when it didn't test what it was supposed to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assumeFalse is best, retry adds complexity to an already complicated piece of code. The test could fail for any number of reasons some including general infrastructure problems retrying would add noise.

Looking at the test, in the setup it waits for the datafeed to complete before closing the job which should have prevented flush happening after the job failed.

https://github.com/elastic/elasticsearch/blob/071b60db2941e54c19446403d4fcd49d9d3a4f9f/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/integration/MlDistributedFailureIT.java#L215

waitForDatafeed waits for the data counts to be indexed which the datafeed does asynchronously before flushing the job. It is amazing that the race between the writing the datacounts and calling flush is sometimes lost to a thread doing an assertBusy searching the index for the new datacounts and then failing the job but that is whats happening here. This is so unlikely (the first time its been reported) it's not worth rewriting the test.

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle davidkyle merged commit 59944a0 into elastic:master Feb 25, 2020
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Feb 25, 2020
@davidkyle davidkyle removed the v7.6.1 label Feb 26, 2020
droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Mar 4, 2020
The assumption added in elastic#52631 skips a problematic test
if it fails to create the required conditions for the
scenario it is supposed to be testing.  (This happens
very rarely.)

However, before skipping the test it needs to remove the
failed job it has created because the standard test
cleanup code treats failed jobs as fatal errors.

Closes elastic#52608
droberts195 pushed a commit that referenced this pull request Mar 5, 2020
The assumption added in #52631 skips a problematic test
if it fails to create the required conditions for the
scenario it is supposed to be testing.  (This happens
very rarely.)

However, before skipping the test it needs to remove the
failed job it has created because the standard test
cleanup code treats failed jobs as fatal errors.

Closes #52608
droberts195 pushed a commit that referenced this pull request Mar 5, 2020
The assumption added in #52631 skips a problematic test
if it fails to create the required conditions for the
scenario it is supposed to be testing.  (This happens
very rarely.)

However, before skipping the test it needs to remove the
failed job it has created because the standard test
cleanup code treats failed jobs as fatal errors.

Closes #52608
@davidkyle davidkyle deleted the ml-dist-failure branch June 2, 2020 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >test Issues or PRs that are addressing/adding tests v7.7.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] MlDistributedFailureIT.testCloseUnassignedFailedJobAndStopUnassignedStoppingDatafeed failed with NPE

4 participants