-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Fix minor race condition in dataframe analytics _stop #53029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix minor race condition in dataframe analytics _stop #53029
Conversation
|
Pinging @elastic/ml-core (:ml) |
|
|
||
| List<String> unavailableIndices = | ||
| verifyIndicesPrimaryShardsAreActive(clusterState, resolver, AnomalyDetectorsIndex.configIndexName()); | ||
| verifyIndicesPrimaryShardsAreActive(clusterState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wait the same way for .ml-state*?
| verifyIndicesPrimaryShardsAreActive(clusterState, | ||
| resolver, | ||
| AnomalyDetectorsIndex.configIndexName(), | ||
| MlStatsIndex.indexPattern()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'm missing something obvious here. The .ml-stats index is created by DataFrameAnalyticsManager.execute called from TaskExecutor.nodeOperation. But TaskExecutor.getAssignment which runs first is expecting the index to exist and as far as I can tell it hasn't been created yet. Could that be because of the leniency in verifyIndicesPrimaryShardsAreActive
resolver.concreteIndexNames(clusterState, IndicesOptions.lenientExpandOpen(), indexNames);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, arguably we could prevent assignment if the required indices exist but are not usable at the time an assignment decision is made. But the changes currently in this PR will not fix #53007.
we could prevent assignment if the required indices exist but are not usable at the time an assignment decision is made
A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded. I am not sure that's a good idea. It certainly deserves a discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Transforms we get the stats directly from the task if it is running else the index is searched. Could we not do the same here? In which case .ml-stats would not be touched until the DFA has finished.
I think for anomaly detection we created most of the indices on node start (apart from the results index) which is one way of solving the problem but is heavy handed and still required checks that the index was active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a scenario y'all are not thinking about.
What if the previously created index is just not available due to the nodes being down?
Its fine being lenient in verifyIndicesPrimaryShardsAreActive if the index does not exist at all. This is because we will create it.
But, if it already exists and its primary shard is not assigned, that is a bigger issue. One warranting the task not being assigned to a node.
For Transforms we get the stats directly from the task if it is running else the index is searched. Could we not do the same here?
We could. I can look into that.
Regardless of waiting for the indices to be created before returning from _start, I think the changes in verifyIndicesPrimaryShardsAreActive are necessary and good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the previously created index is just not available due to the nodes being down?
This is what I was referring to in #53029 (comment)
So it solves one problem but creates another:
A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded. I am not sure that's a good idea. It certainly deserves a discussion.
I think if we do this we should do it consistently for anomaly detection and data frame analytics, so that both or neither assign tasks when the necessary indices exist but are not available.
@dimitris-athanasiou I think you originally added some of these checks. Do you have any extra insights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side effect of this would be that the start datafeed API would fail in some cases where if it had waited a fraction of a second it would have succeeded.
It won't fail. It just won't get assigned a node. Which means that it will attempt to get assigned again in a little bit
This is exactly what we do for anomaly detectors. We verify primary shards are active for the necessary indices. Even for ones created by the task. If the index does not exist, we don't bother checking for the primary shard. If the index does exist, but the shard is not allocated, then we don't assign it to a node. The task attempt to assign itself again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Transforms we get the stats directly from the task if it is running else the index is searched.
We already get the stats for running tasks from in-memory.
@dimitris-athanasiou I think you originally added some of these checks. Do you have any extra insights?
I think not checking whether the state and stats indices have their primaries active there is an omission. But it is not the cause of this error.
@droberts195, @benwtrent and myself discussed this further and we decided we will need to add debugging in order to understand what exactly is going on as it is currently not perfectly clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes in verifyIndicesPrimaryShardsAreActive are necessary and good.
Yes I agree this is what we do for AD and we should check the same indices here.
…ame-fix-start-race-condition
|
@elasticmachine update branch |
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
|
run elasticsearch-ci/2 |
…3029) Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated. This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop. closes elastic#53007
…53164) Tests have been periodically failing due to a race condition on checking a recently `STOPPED` task's state. The `.ml-state` index is not created until the task has already been transitioned to `STARTED`. This allows the `_start` API call to return. But, if a user (or test) immediately attempts to `_stop` that job, the job could stop and the task removed BEFORE the `.ml-state|stats` indices are created/updated. This change moves towards the task cleaning up itself in its main execution thread. `stop` flips the flag of the task to `isStopping` and now we check `isStopping` at every necessary method. Allowing the task to gracefully stop. closes #53007
|
Thanks, Ben! |
…cession quickly (#53192) A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available. If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. For tasks seeking a new assignment after a node failure, that behavior is unchanged. closes #53188
…cession quickly (elastic#53192) A previous change (elastic#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available. If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. For tasks seeking a new assignment after a node failure, that behavior is unchanged. closes elastic#53188
…cession quickly (#53192) (#53332) A previous change (#53029) is causing analysis jobs to wait for certain indices to be made available. While this it is good for jobs to wait, they could fail early on _start. This change will cause the persistent task to continually retry node assignment when the failure is due to shards not being available. If the shards are not available by the time `timeout` is reached by the predicate, it is treated as a _start failure and the task is canceled. For tasks seeking a new assignment after a node failure, that behavior is unchanged. closes #53188
Tests have been periodically failing due to a race condition on checking a recently
STOPPEDtask's state. The.ml-stateindex is not created until the task has already been transitioned toSTARTED. This allows the_startAPI call to return. But, if a user (or test) immediately attempts to_stopthat job, the job could stop and the task removed BEFORE the.ml-state|statsindices are created/updated.This change moves towards the task cleaning up itself in its main execution thread.
stopflips the flag of the task toisStoppingand now we checkisStoppingat every necessary method. Allowing the task to gracefully stop.closes #53007