[ML][Data frame] fixing failure state transitions and race condition #45627

benwtrent · 2019-08-15T17:13:51Z

There is a small window for a race condition while we are flagging a task as failed.

Here are the steps where the race condition occurs:

A failure occurs
Before AsyncTwoPhaseIndexer calls the onFailure handler it does the following:
a. finishAndSetState() which sets the IndexerState to STARTED
b. doSaveState(...) which attempts to save the current state of the indexer
Another trigger is fired BEFORE onFailure can fire, but AFTER finishAndSetState() occurs.

The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP.

Additionally, I move the task state checks INTO the start and stop methods, which will now require a force parameter. start, stop, trigger and markAsFailed are all synchronized. This should gives us some guarantees that one will not switch states out from underneath another.

I also flag the task as failed BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in start and stop will handle this "real state vs cluster state" race condition. This has always been a problem for _stop as it is not a master node action and doesn’t always have the latest cluster state.

closes #45609

Relates to #45562

elasticmachine · 2019-08-15T17:13:53Z

Pinging @elastic/ml-core

benwtrent · 2019-08-15T17:15:00Z

...in/java/org/elasticsearch/xpack/core/dataframe/action/StartDataFrameTransformTaskAction.java

+            } else {
+                // The behavior before V_7_4_0 was that this flag did not exist,
+                // assuming previous checks allowed this task to be started.
+                force = true;


This is the same behavior as previously we only did force checks against the stored cluster state.

benwtrent · 2019-08-15T17:16:32Z

.../src/test/java/org/elasticsearch/xpack/dataframe/integration/DataFrameTaskFailedStateIT.java

-            equalTo("Unable to start data frame transform [test-force-start-failed-transform] as it is in a failed state with failure: [" +
-                failureReason +
-                "]. Use force start to restart data frame transform once error is resolved."));
+        assertBusy(() -> {


This assertBusy is because we may still only read the version of the ClusterState where the task state is STARTED and return a different error than the one we are expecting. This ensures that we will eventually see the clusterstate update and get the failure message we want.

benwtrent · 2019-08-15T17:16:58Z

...ava/org/elasticsearch/xpack/dataframe/action/TransportStartDataFrameTransformTaskAction.java

                                 ActionListener<StartDataFrameTransformTaskAction.Response> listener) {
        if (transformTask.getTransformId().equals(request.getId())) {
-            transformTask.start(null, listener);
+            //TODO fix bug as .start where it was failed could result in a null current checkpoint?


I noticed this potential bug while working through this, I will do investigation in another PR.

benwtrent · 2019-08-15T17:18:27Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

     * @param listener Started listener
     */
-    public synchronized void start(Long startingCheckpoint, ActionListener<Response> listener) {
+    public synchronized void start(Long startingCheckpoint, boolean force, ActionListener<Response> listener) {


I opted for moving the force check (if we end up getting past our earlier checks against the cluster state) INTO these synchronized methods. This gives us some ordering guarantees that make these state transitions easier to reason about.

benwtrent · 2019-08-15T17:19:41Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

+        // We just don't want it to be failed if it is failed
+        // Either we are running, and the STATE is already started or failed
+        // doSaveState should transfer the state to STOPPED when it needs to.
+        taskState.set(DataFrameTransformTaskState.STARTED);


Since force could be true we should just make the task no longer failed so that the stopping logic can take the correct actions.

benwtrent · 2019-08-15T17:21:17Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java


        @Override
        protected void onStart(long now, ActionListener<Boolean> listener) {
+            if (transformTask.taskState.get() == DataFrameTransformTaskState.FAILED) {


All of these failure state checks on the method callbacks are to cover for the situation where another trigger sneaks in after we fail the first one. We want that second trigger to fail ASAP to prevent undesired state interactions.

benwtrent · 2019-08-15T17:35:43Z

...ain/java/org/elasticsearch/xpack/dataframe/action/TransportStopDataFrameTransformAction.java

+
+                    // If all the remaining tasks are flagged as failed, do not wait for another ClusterState update.
+                    // Return to the caller as soon as possible
+                    return persistentTasksCustomMetaData.tasks().stream().allMatch(p -> exceptions.containsKey(p.getId()));


Not having this check causes periodic timeouts when I was running locally. If the clusterstate is not updated for 30s, this predicate times out without this check because it never sees that all the tasks have been flagged as failed.

davidkyle

LGTM.

The error messages are much better

...ain/java/org/elasticsearch/xpack/dataframe/action/TransportStopDataFrameTransformAction.java

davidkyle · 2019-08-16T09:07:44Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

                            r -> {
                                // for auto stop shutdown the task
                                if (state.getTaskState().equals(DataFrameTransformTaskState.STOPPED)) {
-                                    onStop();


Do we not need this?? I know all it does it log & audit a message

@davidkyle onStop is called in the Async indexer when transitioning from STOPPING -> STOPPED and we call it directly if stop transitions the indexer directly to STOPPED. This resulted in there always being two log entries for stopping which confused me at first (thinking somebody called stop twice).

davidkyle · 2019-08-16T09:22:42Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

        if (getIndexer() != null && getIndexer().getState() == IndexerState.STOPPING) {
            logger.info("Attempt to fail transform [" + getTransformId() + "] with reason [" + reason + "] while it was stopping.");
            auditor.info(getTransformId(), "Attempted to fail transform with reason [" + reason + "] while in STOPPING state.");
+            listener.onResponse(null);


Oops. Good spot.

davidkyle · 2019-08-16T09:26:30Z

run elasticsearch-ci/1

…/dataframe/action/TransportStopDataFrameTransformAction.java Co-Authored-By: David Kyle <[email protected]>

benwtrent · 2019-08-16T13:51:56Z

Waiting to open the backport of this until after this has had a chance to bump around in CI.

…lastic#45627) There is a small window for a race condition while we are flagging a task as failed. Here are the steps where the race condition occurs: 1. A failure occurs 2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following: a. `finishAndSetState()` which sets the IndexerState to STARTED b. `doSaveState(...)` which attempts to save the current state of the indexer 3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs. The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP. Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another. I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state. closes elastic#45609 Relates to elastic#45562

…45627) (#45656) * [ML][Data frame] fixing failure state transitions and race condition (#45627) There is a small window for a race condition while we are flagging a task as failed. Here are the steps where the race condition occurs: 1. A failure occurs 2. Before `AsyncTwoPhaseIndexer` calls the `onFailure` handler it does the following: a. `finishAndSetState()` which sets the IndexerState to STARTED b. `doSaveState(...)` which attempts to save the current state of the indexer 3. Another trigger is fired BEFORE `onFailure` can fire, but AFTER `finishAndSetState()` occurs. The trick here is that we will eventually set the indexer to failed, but possibly not before another trigger had the opportunity to fire. This could obviously cause some weird state interactions. To combat this, I have put in some predicates to verify the state before taking actions. This is so if state is indeed marked failed, the "second trigger" stops ASAP. Additionally, I move the task state checks INTO the `start` and `stop` methods, which will now require a `force` parameter. `start`, `stop`, `trigger` and `markAsFailed` are all `synchronized`. This should gives us some guarantees that one will not switch states out from underneath another. I also flag the task as `failed` BEFORE we successfully write it to cluster state, this is to allow us to make the task fail more quickly. But, this does add the behavior where the task is "failed" but the cluster state does not indicate as much. Adding the checks in `start` and `stop` will handle this "real state vs cluster state" race condition. This has always been a problem for `_stop` as it is not a master node action and doesn’t always have the latest cluster state. closes #45609 Relates to #45562 * [ML][Data Frame] moves failure state transition for MT safety (#45676) * [ML][Data Frame] moves failure state transition for MT safety * removing unused imports

[ML][Data frame] adjusting failure state transitions and race condition

8570d71

benwtrent added >bug v8.0.0 :ml/Transform Transform v7.4.0 labels Aug 15, 2019

benwtrent commented Aug 15, 2019

View reviewed changes

fixing precheck and message format failures

860e901

benwtrent commented Aug 15, 2019

View reviewed changes

fixing style

2499fe9

davidkyle approved these changes Aug 16, 2019

View reviewed changes

Update x-pack/plugin/data-frame/src/main/java/org/elasticsearch/xpack…

5cb95ec

…/dataframe/action/TransportStopDataFrameTransformAction.java Co-Authored-By: David Kyle <[email protected]>

benwtrent merged commit b9fd67b into elastic:master Aug 16, 2019

benwtrent deleted the feature/ml-df-address-failure-race-conditions branch August 16, 2019 13:50

benwtrent added the backport pending label Aug 16, 2019

This was referenced Aug 16, 2019

[ML][Data Frame] fixing _start?force=true bug #45660

Merged

Frequent failures for data frame tests in CI #45562

Closed

testForceStartFailedTransform fails with STARTED expected FAILED #45664

Closed

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 19, 2019

[ML][Data Frame] muting tests for backport of (elastic#45627)

d15ab7f

benwtrent mentioned this pull request Aug 19, 2019

[ML][Data Frame] muting tests for backport of (#45627) #45697

Merged

benwtrent mentioned this pull request Aug 19, 2019

[ML][Data frame] fixing failure state transitions and race condition (#45627) #45656

Merged

benwtrent added a commit that referenced this pull request Aug 20, 2019

[ML][Data Frame] muting tests for backport of (#45627) (#45697)

9989276

benwtrent removed the backport pending label Aug 20, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

[ML][Data frame] fixing failure state transitions and race condition #45627

[ML][Data frame] fixing failure state transitions and race condition #45627

Uh oh!

Conversation

benwtrent commented Aug 15, 2019

Uh oh!

elasticmachine commented Aug 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkyle commented Aug 16, 2019

Uh oh!

benwtrent commented Aug 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benwtrent Aug 15, 2019 •

edited

Loading