Skip to content

Conversation

@benwtrent
Copy link
Member

@benwtrent benwtrent commented Aug 16, 2019

There is a small window for a race condition while we are flagging a task as failed.

Here are the steps where the race condition occurs:

  1. A failure occurs
  2. Before AsyncTwoPhaseIndexer calls the onFailure handler it does the following:
    a. finishAndSetState() which sets the IndexerState to STARTED
    b. doSaveState(...) which attempts to save the current state of the indexer
  3. Another trigger is fired BEFORE onFailure can fire, but AFTER finishAndSetState() occurs.

The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.

Additionally, I move the task state checks INTO the start and stop methods, which will now require a force parameter. start, stop, trigger and markAsFailed are all synchronized. This should gives us some guarantees that one will not switch states out from underneath another.

I also flag the task as failed BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in start and stop will handle this "real state vs cluster state" race condition. This has always been a problem for _stop as it is not a master node action and doesn’t always have the latest cluster state.

closes #45609

Relates to #45562

Backport of #45627 and #45676

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@benwtrent
Copy link
Member Author

Don't mark this ready to merge for a little while until the initial commit has time to bake for a bit in CI.

Additionally, BWC tests should be muted in master before this is merged as calling _start in a mixed cluster will fail until master updates is bwc serialization of the request object.

…lastic#45627)

There is a small window for a race condition while we are flagging a task as failed.

Here are the steps where the race condition occurs:
1. A failure occurs
2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following:
   a. `finishAndSetState()` which sets the IndexerState to STARTED
   b. `doSaveState(...)` which attempts to save the current state of the indexer
3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs.

The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.

Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another.

I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state.

closes elastic#45609

Relates to elastic#45562
…c#45676)

* [ML][Data Frame] moves failure state transition for MT safety

* removing unused imports
@benwtrent benwtrent force-pushed the backport/7.x/pr-45627 branch from f9df427 to 27405af Compare August 19, 2019 13:31
@benwtrent benwtrent marked this pull request as ready for review August 20, 2019 12:29
@benwtrent benwtrent merged commit 88641a0 into elastic:7.x Aug 20, 2019
@benwtrent benwtrent deleted the backport/7.x/pr-45627 branch August 20, 2019 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants