[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a small window for a race condition while we are flagging a task as failed.
Here are the steps where the race condition occurs:
AsyncTwoPhaseIndexercalls theonFailurehandler it does the following:a.
finishAndSetState()which sets the IndexerState to STARTEDb.
doSaveState(...)which attempts to save the current state of the indexeronFailurecan fire, but AFTERfinishAndSetState()occurs.The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.
Additionally, I move the task state checks INTO the
startandstopmethods, which will now require aforceparameter.start,stop,triggerandmarkAsFailedare allsynchronized. This should gives us some guarantees that one will not switch states out from underneath another.I also flag the task as
failedBEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks instartandstopwill handle this "real state vs cluster state" race condition. This has always been a problem for_stopas it is not a master node action and doesn’t always have the latest cluster state.closes #45609
Relates to #45562
Backport of #45627 and #45676