[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

benwtrent · 2019-08-16T14:09:25Z

There is a small window for a race condition while we are flagging a task as failed.

Here are the steps where the race condition occurs:

A failure occurs
Before AsyncTwoPhaseIndexer calls the onFailure handler it does the following:
a. finishAndSetState() which sets the IndexerState to STARTED
b. doSaveState(...) which attempts to save the current state of the indexer
Another trigger is fired BEFORE onFailure can fire, but AFTER finishAndSetState() occurs.

The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.

Additionally, I move the task state checks INTO the start and stop methods, which will now require a force parameter. start, stop, trigger and markAsFailed are all synchronized. This should gives us some guarantees that one will not switch states out from underneath another.

I also flag the task as failed BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in start and stop will handle this "real state vs cluster state" race condition. This has always been a problem for _stop as it is not a master node action and doesn’t always have the latest cluster state.

closes #45609

Relates to #45562

Backport of #45627 and #45676

elasticmachine · 2019-08-16T14:09:27Z

Pinging @elastic/ml-core

benwtrent · 2019-08-16T14:10:27Z

Don't mark this ready to merge for a little while until the initial commit has time to bake for a bit in CI.

Additionally, BWC tests should be muted in master before this is merged as calling _start in a mixed cluster will fail until master updates is bwc serialization of the request object.

…lastic#45627) There is a small window for a race condition while we are flagging a task as failed. Here are the steps where the race condition occurs: 1. A failure occurs 2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following: a. `finishAndSetState()` which sets the IndexerState to STARTED b. `doSaveState(...)` which attempts to save the current state of the indexer 3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs. The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP. Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another. I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state. closes elastic#45609 Relates to elastic#45562

…c#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports

benwtrent added >bug backport :ml/Transform Transform v7.4.0 labels Aug 16, 2019

benwtrent added 2 commits August 19, 2019 08:15

[ML][Data Frame] moves failure state transition for MT safety (elasti…

27405af

…c#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports

benwtrent force-pushed the backport/7.x/pr-45627 branch from f9df427 to 27405af Compare August 19, 2019 13:31

benwtrent marked this pull request as ready for review August 20, 2019 12:29

benwtrent merged commit 88641a0 into elastic:7.x Aug 20, 2019

benwtrent deleted the backport/7.x/pr-45627 branch August 20, 2019 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

Uh oh!

benwtrent commented Aug 16, 2019 •

edited

Loading

Uh oh!

elasticmachine commented Aug 16, 2019

Uh oh!

benwtrent commented Aug 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

Uh oh!

Conversation

benwtrent commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Aug 16, 2019

Uh oh!

benwtrent commented Aug 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benwtrent commented Aug 16, 2019 •

edited

Loading