Skip to content

Conversation

@benwtrent
Copy link
Member

@benwtrent benwtrent commented Mar 2, 2020

Tests have been periodically failing due to a race condition on checking a recently STOPPED task's state. The .ml-state index is not created until the task has already been transitioned to STARTED. This allows the _start API call to return. But, if a user (or test) immediately attempts to _stop that job, the job could stop and the task removed BEFORE the .ml-state|stats indices are created/updated.

This change moves towards the task cleaning up itself in its main execution thread. stop flips the flag of the task to isStopping and now we check isStopping at every necessary method. Allowing the task to gracefully stop.

closes #53007

@benwtrent benwtrent added >test Issues or PRs that are addressing/adding tests >refactoring :ml Machine learning v8.0.0 v7.7.0 labels Mar 2, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)


List<String> unavailableIndices =
verifyIndicesPrimaryShardsAreActive(clusterState, resolver, AnomalyDetectorsIndex.configIndexName());
verifyIndicesPrimaryShardsAreActive(clusterState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait the same way for .ml-state*?

verifyIndicesPrimaryShardsAreActive(clusterState,
resolver,
AnomalyDetectorsIndex.configIndexName(),
MlStatsIndex.indexPattern());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm missing something obvious here. The .ml-stats index is created by DataFrameAnalyticsManager.execute called from TaskExecutor.nodeOperation. But TaskExecutor.getAssignment which runs first is expecting the index to exist and as far as I can tell it hasn't been created yet. Could that be because of the leniency in verifyIndicesPrimaryShardsAreActive

resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexNames);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, arguably we could prevent assignment if the required indices exist but are not usable at the time an assignment decision is made. But the changes currently in this PR will not fix #53007.

we could prevent assignment if the required indices exist but are not usable at the time an assignment decision is made

A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded. I am not sure that's a good idea. It certainly deserves a discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Transforms we get the stats directly from the task if it is running else the index is searched. Could we not do the same here? In which case .ml-stats would not be touched until the DFA has finished.

I think for anomaly detection we created most of the indices on node start (apart from the results index) which is one way of solving the problem but is heavy handed and still required checks that the index was active.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a scenario y'all are not thinking about.

What if the previously created index is just not available due to the nodes being down?

Its fine being lenient in verifyIndicesPrimaryShardsAreActive if the index does not exist at all. This is because we will create it.

But, if it already exists and its primary shard is not assigned, that is a bigger issue. One warranting the task not being assigned to a node.

For Transforms we get the stats directly from the task if it is running else the index is searched. Could we not do the same here?

We could. I can look into that.

Regardless of waiting for the indices to be created before returning from _start, I think the changes in verifyIndicesPrimaryShardsAreActive are necessary and good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the previously created index is just not available due to the nodes being down?

This is what I was referring to in #53029 (comment)

So it solves one problem but creates another:

A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded. I am not sure that's a good idea. It certainly deserves a discussion.

I think if we do this we should do it consistently for anomaly detection and data frame analytics, so that both or neither assign tasks when the necessary indices exist but are not available.

@dimitris-athanasiou I think you originally added some of these checks. Do you have any extra insights?

Copy link
Member Author

@benwtrent benwtrent Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded.

It won't fail. It just won't get assigned a node. Which means that it will attempt to get assigned again in a little bit

This is exactly what we do for anomaly detectors. We verify primary shards are active for the necessary indices. Even for ones created by the task. If the index does not exist, we don't bother checking for the primary shard. If the index does exist, but the shard is not allocated, then we don't assign it to a node. The task attempt to assign itself again later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Transforms we get the stats directly from the task if it is running else the index is searched.

We already get the stats for running tasks from in-memory.

@dimitris-athanasiou I think you originally added some of these checks. Do you have any extra insights?

I think not checking whether the state and stats indices have their primaries active there is an omission. But it is not the cause of this error.

@droberts195, @benwtrent and myself discussed this further and we decided we will need to add debugging in order to understand what exactly is going on as it is currently not perfectly clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in verifyIndicesPrimaryShardsAreActive are necessary and good.

Yes I agree this is what we do for AD and we should check the same indices here.

@benwtrent benwtrent changed the title [ML] Fix minor race condition in dataframe analytics _start [ML] Fix minor race condition in dataframe analytics _stop Mar 4, 2020
@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

@benwtrent benwtrent merged commit 932f81b into elastic:master Mar 5, 2020
@benwtrent benwtrent deleted the feature/ml-dataframe-fix-start-race-condition branch March 5, 2020 13:32
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Mar 5, 2020
…3029)

Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.

This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.

closes elastic#53007
benwtrent added a commit that referenced this pull request Mar 5, 2020
…53164)

Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated.

This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop.

closes #53007
@mark-vieira
Copy link
Contributor

Thanks, Ben!

benwtrent added a commit that referenced this pull request Mar 10, 2020
…cession quickly (#53192)

A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. 

This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.

If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. 

For tasks seeking a new assignment after a node failure, that behavior is unchanged.


closes #53188
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Mar 10, 2020
…cession quickly (elastic#53192)

A previous change (elastic#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. 

This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.

If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. 

For tasks seeking a new assignment after a node failure, that behavior is unchanged.


closes elastic#53188
benwtrent added a commit that referenced this pull request Mar 10, 2020
…cession quickly (#53192) (#53332)

A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. 

This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available.

If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. 

For tasks seeking a new assignment after a node failure, that behavior is unchanged.


closes #53188
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >refactoring >test Issues or PRs that are addressing/adding tests v7.7.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RunDataFrameAnalyticsIT.testStopOutlierDetectionWithEnoughDocumentsToScroll failures

8 participants