-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Handle failed datafeed in MlDistributedFailureIT #52631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/ml-core (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this is that if task == null then the test isn't doing what it's supposed to be doing, which is testing that you can stop an unassigned stopping datafeed.
Since this happens so rarely we could put the whole test in a loop and retry a few times if this situation occurs.
But if you don't want to do that then it would be better to use assumeFalse("Test setup did not create the required conditions", task == null); because then at least the test will be reported as having been skipped rather than silently succeeding when it didn't test what it was supposed to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assumeFalse is best, retry adds complexity to an already complicated piece of code. The test could fail for any number of reasons some including general infrastructure problems retrying would add noise.
Looking at the test, in the setup it waits for the datafeed to complete before closing the job which should have prevented flush happening after the job failed.
waitForDatafeed waits for the data counts to be indexed which the datafeed does asynchronously before flushing the job. It is amazing that the race between the writing the datacounts and calling flush is sometimes lost to a thread doing an assertBusy searching the index for the new datacounts and then failing the job but that is whats happening here. This is so unlikely (the first time its been reported) it's not worth rewriting the test.
071b60d to
f61e53e
Compare
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The assumption added in elastic#52631 skips a problematic test if it fails to create the required conditions for the scenario it is supposed to be testing. (This happens very rarely.) However, before skipping the test it needs to remove the failed job it has created because the standard test cleanup code treats failed jobs as fatal errors. Closes elastic#52608
The assumption added in #52631 skips a problematic test if it fails to create the required conditions for the scenario it is supposed to be testing. (This happens very rarely.) However, before skipping the test it needs to remove the failed job it has created because the standard test cleanup code treats failed jobs as fatal errors. Closes #52608
The assumption added in #52631 skips a problematic test if it fails to create the required conditions for the scenario it is supposed to be testing. (This happens very rarely.) However, before skipping the test it needs to remove the failed job it has created because the standard test cleanup code treats failed jobs as fatal errors. Closes #52608
Rarely the datafeed under test may detect the job has failed and terminate itself. The job failure is an expected part of the test but the test does not account for the datafeed stopping and having no persistent task. If the datafeed has stopped then don't set its state to stopping
Details of how this happens are in #52608 (comment)
Closes #52608