[ML] Handle failed datafeed in MlDistributedFailureIT #52631

davidkyle · 2020-02-21T10:45:58Z

Rarely the datafeed under test may detect the job has failed and terminate itself. The job failure is an expected part of the test but the test does not account for the datafeed stopping and having no persistent task. If the datafeed has stopped then don't set its state to stopping

Details of how this happens are in #52608 (comment)

Closes #52608

elasticmachine · 2020-02-21T10:46:01Z

Pinging @elastic/ml-core (:ml)

droberts195 · 2020-02-21T11:52:34Z

...k/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/integration/MlDistributedFailureIT.java

The problem with this is that if task == null then the test isn't doing what it's supposed to be doing, which is testing that you can stop an unassigned stopping datafeed.

Since this happens so rarely we could put the whole test in a loop and retry a few times if this situation occurs.

But if you don't want to do that then it would be better to use assumeFalse("Test setup did not create the required conditions", task == null); because then at least the test will be reported as having been skipped rather than silently succeeding when it didn't test what it was supposed to test.

I think assumeFalse is best, retry adds complexity to an already complicated piece of code. The test could fail for any number of reasons some including general infrastructure problems retrying would add noise.

Looking at the test, in the setup it waits for the datafeed to complete before closing the job which should have prevented flush happening after the job failed.

https://github.com/elastic/elasticsearch/blob/071b60db2941e54c19446403d4fcd49d9d3a4f9f/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/integration/MlDistributedFailureIT.java#L215

waitForDatafeed waits for the data counts to be indexed which the datafeed does asynchronously before flushing the job. It is amazing that the race between the writing the datacounts and calling flush is sometimes lost to a thread doing an assertBusy searching the index for the new datacounts and then failing the job but that is whats happening here. This is so unlikely (the first time its been reported) it's not worth rewriting the test.

droberts195

LGTM

The assumption added in elastic#52631 skips a problematic test if it fails to create the required conditions for the scenario it is supposed to be testing. (This happens very rarely.) However, before skipping the test it needs to remove the failed job it has created because the standard test cleanup code treats failed jobs as fatal errors. Closes elastic#52608

The assumption added in #52631 skips a problematic test if it fails to create the required conditions for the scenario it is supposed to be testing. (This happens very rarely.) However, before skipping the test it needs to remove the failed job it has created because the standard test cleanup code treats failed jobs as fatal errors. Closes #52608

davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v7.7.0 v7.6.1 labels Feb 21, 2020

droberts195 reviewed Feb 21, 2020

View reviewed changes

davidkyle added 2 commits February 24, 2020 10:56

Handle case where datafeed has stopped

08ec002

assume false

f61e53e

davidkyle force-pushed the ml-dist-failure branch from 071b60d to f61e53e Compare February 24, 2020 10:59

droberts195 approved these changes Feb 25, 2020

View reviewed changes

davidkyle merged commit 59944a0 into elastic:master Feb 25, 2020

davidkyle mentioned this pull request Feb 25, 2020

[7.x] [ML] Handle failed datafeed in MlDistributedFailureIT (#52631) #52789

Merged

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Feb 25, 2020

[ML] Handle failed datafeed in MlDistributedFailureIT (elastic#52631)

0fc7e7d

davidkyle removed the v7.6.1 label Feb 26, 2020

davidkyle added a commit that referenced this pull request Feb 26, 2020

[ML] Handle failed datafeed in MlDistributedFailureIT (#52631) (#52789)

37be695

droberts195 mentioned this pull request Mar 4, 2020

[TEST] Force close failed job before skipping test #53128

Merged

davidkyle deleted the ml-dist-failure branch June 2, 2020 08:59

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Handle failed datafeed in MlDistributedFailureIT #52631

[ML] Handle failed datafeed in MlDistributedFailureIT #52631

Uh oh!

davidkyle commented Feb 21, 2020

Uh oh!

elasticmachine commented Feb 21, 2020

Uh oh!

droberts195 Feb 21, 2020

Uh oh!

davidkyle Feb 21, 2020

Uh oh!

droberts195 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ML] Handle failed datafeed in MlDistributedFailureIT #52631

[ML] Handle failed datafeed in MlDistributedFailureIT #52631

Uh oh!

Conversation

davidkyle commented Feb 21, 2020

Uh oh!

elasticmachine commented Feb 21, 2020

Uh oh!

droberts195 Feb 21, 2020

Choose a reason for hiding this comment

Uh oh!

davidkyle Feb 21, 2020

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants