[ML] Fix minor race condition in dataframe analytics _stop #53029

benwtrent · 2020-03-02T19:34:32Z

Tests have been periodically failing due to a race condition on checking a recently STOPPED task's state. The .ml-state index is not created until the task has already been transitioned to STARTED. This allows the _start API call to return. But, if a user (or test) immediately attempts to _stop that job, the job could stop and the task removed BEFORE the .ml-state|stats indices are created/updated.

This change moves towards the task cleaning up itself in its main execution thread. stop flips the flag of the task to isStopping and now we check isStopping at every necessary method. Allowing the task to gracefully stop.

closes #53007

elasticmachine · 2020-03-02T19:34:35Z

Pinging @elastic/ml-core (:ml)

przemekwitek · 2020-03-03T07:41:02Z

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportStartDataFrameAnalyticsAction.java


            List<String> unavailableIndices =
-                verifyIndicesPrimaryShardsAreActive(clusterState, resolver, AnomalyDetectorsIndex.configIndexName());
+                verifyIndicesPrimaryShardsAreActive(clusterState,


Should we wait the same way for .ml-state*?

davidkyle · 2020-03-03T09:42:06Z

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportStartDataFrameAnalyticsAction.java

+                verifyIndicesPrimaryShardsAreActive(clusterState,
+                    resolver,
+                    AnomalyDetectorsIndex.configIndexName(),
+                    MlStatsIndex.indexPattern());


I feel like I'm missing something obvious here. The .ml-stats index is created by DataFrameAnalyticsManager.execute called from TaskExecutor.nodeOperation. But TaskExecutor.getAssignment which runs first is expecting the index to exist and as far as I can tell it hasn't been created yet. Could that be because of the leniency in verifyIndicesPrimaryShardsAreActive

resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexNames);

Yes, arguably we could prevent assignment if the required indices exist but are not usable at the time an assignment decision is made. But the changes currently in this PR will not fix #53007.

we could prevent assignment if the required indices exist but are not usable at the time an assignment decision is made

A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded. I am not sure that's a good idea. It certainly deserves a discussion.

For Transforms we get the stats directly from the task if it is running else the index is searched. Could we not do the same here? In which case .ml-stats would not be touched until the DFA has finished.

I think for anomaly detection we created most of the indices on node start (apart from the results index) which is one way of solving the problem but is heavy handed and still required checks that the index was active.

There is a scenario y'all are not thinking about.

What if the previously created index is just not available due to the nodes being down?

Its fine being lenient in verifyIndicesPrimaryShardsAreActive if the index does not exist at all. This is because we will create it.

But, if it already exists and its primary shard is not assigned, that is a bigger issue. One warranting the task not being assigned to a node.

For Transforms we get the stats directly from the task if it is running else the index is searched. Could we not do the same here?

We could. I can look into that.

Regardless of waiting for the indices to be created before returning from _start, I think the changes in verifyIndicesPrimaryShardsAreActive are necessary and good.

What if the previously created index is just not available due to the nodes being down?

This is what I was referring to in #53029 (comment)

So it solves one problem but creates another:

A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded. I am not sure that's a good idea. It certainly deserves a discussion.

I think if we do this we should do it consistently for anomaly detection and data frame analytics, so that both or neither assign tasks when the necessary indices exist but are not available.

@dimitris-athanasiou I think you originally added some of these checks. Do you have any extra insights?

A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded.

It won't fail. It just won't get assigned a node. Which means that it will attempt to get assigned again in a little bit

This is exactly what we do for anomaly detectors. We verify primary shards are active for the necessary indices. Even for ones created by the task. If the index does not exist, we don't bother checking for the primary shard. If the index does exist, but the shard is not allocated, then we don't assign it to a node. The task attempt to assign itself again later.

For Transforms we get the stats directly from the task if it is running else the index is searched.

We already get the stats for running tasks from in-memory.

@dimitris-athanasiou I think you originally added some of these checks. Do you have any extra insights?

I think not checking whether the state and stats indices have their primaries active there is an omission. But it is not the cause of this error.

@droberts195, @benwtrent and myself discussed this further and we decided we will need to add debugging in order to understand what exactly is going on as it is currently not perfectly clear.

changes in verifyIndicesPrimaryShardsAreActive are necessary and good.

Yes I agree this is what we do for AD and we should check the same indices here.

…ame-fix-start-race-condition

benwtrent · 2020-03-04T19:51:01Z

@elasticmachine update branch

dimitris-athanasiou

LGTM

droberts195

LGTM2

benwtrent · 2020-03-05T12:31:42Z

run elasticsearch-ci/2

…3029) Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated. This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop. closes elastic#53007

…53164) Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated. This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop. closes #53007

mark-vieira · 2020-03-05T17:29:01Z

Thanks, Ben!

…cession quickly (#53192) A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available. If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. For tasks seeking a new assignment after a node failure, that behavior is unchanged. closes #53188

…cession quickly (elastic#53192) A previous change (elastic#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available. If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. For tasks seeking a new assignment after a node failure, that behavior is unchanged. closes elastic#53188

…cession quickly (#53192) (#53332) A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available. If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. For tasks seeking a new assignment after a node failure, that behavior is unchanged. closes #53188

[ML] Fix minor race condition in dataframe analytics _start

ca7b6f9

benwtrent added >test Issues or PRs that are addressing/adding tests >refactoring :ml Machine learning v8.0.0 v7.7.0 labels Mar 2, 2020

przemekwitek reviewed Mar 3, 2020

View reviewed changes

davidkyle reviewed Mar 3, 2020

View reviewed changes

benwtrent added 2 commits March 4, 2020 08:14

fixing _stop race condition

8af55ce

Merge remote-tracking branch 'upstream/master' into feature/ml-datafr…

32706de

…ame-fix-start-race-condition

benwtrent changed the title ~~[ML] Fix minor race condition in dataframe analytics _start~~ [ML] Fix minor race condition in dataframe analytics _stop Mar 4, 2020

benwtrent added 2 commits March 4, 2020 11:11

fixing test

17d3067

fixing bad handling of stop during reindex

3544450

benwtrent requested review from dimitris-athanasiou and droberts195 March 4, 2020 19:51

Merge branch 'master' into feature/ml-dataframe-fix-start-race-condition

fa633bc

dimitris-athanasiou approved these changes Mar 5, 2020

View reviewed changes

droberts195 approved these changes Mar 5, 2020

View reviewed changes

benwtrent merged commit 932f81b into elastic:master Mar 5, 2020

benwtrent deleted the feature/ml-dataframe-fix-start-race-condition branch March 5, 2020 13:32

benwtrent mentioned this pull request Mar 5, 2020

[7.x] [ML] Fix minor race condition in dataframe analytics _stop (#53029) #53164

Merged

benwtrent mentioned this pull request Mar 5, 2020

[ML] fixing data frame analysis test when two jobs are started in succession quickly #53192

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

[ML] Fix minor race condition in dataframe analytics _stop #53029

[ML] Fix minor race condition in dataframe analytics _stop #53029

Uh oh!

Conversation

benwtrent commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Mar 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Mar 4, 2020

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Mar 5, 2020

Uh oh!

mark-vieira commented Mar 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

benwtrent commented Mar 2, 2020 •

edited

Loading

benwtrent Mar 3, 2020 •

edited

Loading