-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML][Data Frame] fixing _start?force=true bug #45660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Data Frame] fixing _start?force=true bug #45660
Conversation
|
Pinging @elastic/ml-core |
| } | ||
|
|
||
| public void setTaskStateStopped() { | ||
| public synchronized void setTaskStateStopped() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this synchronized as start stop and markAsFailed are all synchronized and we do not want task state flipping between states while we are attempting to transition it.
doSaveState calls setTaskStateStopped and since synchronized utilize reentrant locks, stop calling doSaveState directly in certain scenarios should not be a problem.
| // fully initialized. | ||
| // If we are NOT failed, then we can assume that `start` was just called early in the process. | ||
| String msg = taskState.get() == DataFrameTransformTaskState.FAILED ? | ||
| "It failed during the initialization process; force stop to allow reinitialization." : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we failed before even being fully initialized, there was probably something REALLY wrong with the cluster at the time. We should indicate to the user that _stop?force=true needs to be called before attempting to start again.
| synchronized void markAsFailed(String reason, ActionListener<Void> listener) { | ||
| // If we are already flagged as failed, this probably means that a second trigger started firing while we were attempting to | ||
| // flag the previously triggered indexer as failed. Exit early as we are already flagged as failed. | ||
| if (taskState.get() == DataFrameTransformTaskState.FAILED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason to move this up earlier in the stack is to prevent needless checks. Additionally, we don't want to accidentally return weird error messages as a task could potentially (however rarely) be FAILED but the indexer be in STOPPED or STOPPING state.
| getIndexer() == null ? null : getIndexer().getProgress()); | ||
| taskState.set(DataFrameTransformTaskState.FAILED); | ||
| stateReason.set(reason); | ||
| DataFrameTransformState newState = getState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is safe to call now like this since setTaskStateStopped is synchronized
|
run elasticsearch-ci/2 |
davidkyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [ML][Data Frame] fixing _start?force=true bug * removing unused import * removing old TODO
The follow scenario currently blows up in the user's face with confusing errors:
_start?force=trueagainst the transforThis is because we exit the process early and don't allow state/stats to be fully initialized in a failed task so it can be started.
Since PR: #45627 we now check for
FAILEDstate in thestartmethod. This allows us to fully initialize the task and rely onstartto fail due to the previously loaded state from its creation.