[ML][Data Frame] add support for `wait_for_checkpoint` flag on `_stop` API #45469

benwtrent · 2019-08-12T20:22:21Z

This adds a new flag wait_for_checkpoint for the _stop transform API.

The behavior is as follows:

If the indexer is running, the transform will not stop until the next checkpoint is reached
If the indexer is NOT running, the transform will stop immediately
If the indexer is failed (or fails after the call is made), the task is still marked as failed, and the user will have to attempt to stop it with force=true.

Implementation details:

I opted for storing this flag and state in the cluster state as the index did not make sense when we considering the race conditions around a user calling _stop right when an doSaveState action is being called as the persisted state MAY be overwritten.
Not utilizing a STOPPING state since the enumerations are stored in the index, and attempting to read in a new value via the XContentParser on an old node would blow up.

closes #45293

elasticmachine · 2019-08-12T20:22:23Z

Pinging @elastic/ml-core

benwtrent · 2019-08-12T20:24:59Z

...ain/java/org/elasticsearch/xpack/dataframe/action/TransportStopDataFrameTransformAction.java

-            listener.onResponse(new StopDataFrameTransformAction.Response(Boolean.TRUE));
+            // To cover strange state race conditions, we adjust the variable first (which writes to cluster state if it is different)
+            // then we stop the transform
+            transformTask.setShouldStopAtCheckpoint(request.isWaitForCheckpoint(), ActionListener.wrap(


Since the default value for wait_for_checkpoint is true and folks could call this against _all transforms, this could get pretty expensive as there will be a cluster state update for each individual task.

Do we want to reconsider the default value of true or do we think this behavior is acceptable?

We already do a cluster state update for each transform when _all are stopped, because removing each persistent task is a cluster state update.

With this change, if there are 50 started transforms then we'll go from doing 50 cluster state updates when stopping _all to 100: 50 to set the flags and 50 later on to remove the tasks. So it's not like we're going from being completely independent of cluster state to suddenly creating a large cluster state update load.

One thing we could consider if we think most people will stick with the default of wait_for_checkpoint: true is default this variable to true when the first DataFrameTransformState is created when the persistent task is first created. Then it won't need updating in the common case, only if people have overridden wait_for_checkpoint: false.

The difficulty with having it default to true within the task, is now we will need two booleans:

one that indicates that when we stop, we should wait for the checkpoint

one that indicates that we are stopping

Because checkpoints are completed when onFinish is called, we need a way to determine that we are "stopping" and the task should complete. In this PR, we are relying on the flag being set to indicate that the task should stop when onFinish is completed.

benwtrent · 2019-08-12T20:27:01Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

                getTransformId()));
            return;
        }
+        if (shouldStopAtCheckpoint) {


We don't want start to be called if we are shouldStopAtCheckpoint.

Honestly, calling start here is pretty harmless, but I think it is better to throw to alert the user that "hey, we are stopping this thing soon".

This covers the case where the DF is stopping with wait_for_completion=true then someone tries to start it again via the API?

yes @davidkyle , as I said, calling start here should not cause problems, but if the user is attempting to start it after somebody already called _stop?wait_for_checkpoint=true, I think we should throw an error indicating that we are stopping

benwtrent · 2019-08-12T20:27:58Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

        }

-        if (getIndexer().getState() == IndexerState.STOPPED) {
+        if (getIndexer().getState() == IndexerState.STOPPED || getIndexer().getState() == IndexerState.STOPPING) {


This should have always been here, we should not allow the execution path to continue if we are already STOPPING

benwtrent · 2019-08-12T20:28:58Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java


        logger.debug("Data frame indexer [{}] schedule has triggered, state: [{}]", event.getJobName(), getIndexer().getState());
+        // If we are failed, don't trigger
+        if (taskState.get() == DataFrameTransformTaskState.FAILED) {


Having these checks earlier (they are done lower down in the indexer as well) allow for simpler reasoning around what is done within these synchronized methods of stop, start, and triggered

benwtrent · 2019-08-12T20:29:36Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

                nextCheckpoint = null;
                // Reset our failure count as we have finished and may start again with a new checkpoint
                failureCount.set(0);
+                transformTask.stateReason.set(null);


if we are finishing, the state reason should be cleared out regardless.

benwtrent · 2019-08-12T22:03:32Z

run elasticsearch-ci/bwc
run elasticsearch-ci/default-distro

benwtrent · 2019-08-13T11:57:39Z

@elasticmachine update branch

droberts195

I think there are quite a few tricky issues to think about with this. I left a few comments. Maybe we also need to discuss it further as a group on the weekly call.

droberts195 · 2019-08-13T12:42:29Z

...ain/java/org/elasticsearch/xpack/dataframe/action/TransportStopDataFrameTransformAction.java

-            listener.onResponse(new StopDataFrameTransformAction.Response(Boolean.TRUE));
+            // To cover strange state race conditions, we adjust the variable first (which writes to cluster state if it is different)
+            // then we stop the transform
+            transformTask.setShouldStopAtCheckpoint(request.isWaitForCheckpoint(), ActionListener.wrap(


We already do a cluster state update for each transform when _all are stopped, because removing each persistent task is a cluster state update.

With this change, if there are 50 started transforms then we'll go from doing 50 cluster state updates when stopping _all to 100: 50 to set the flags and 50 later on to remove the tasks. So it's not like we're going from being completely independent of cluster state to suddenly creating a large cluster state update load.

One thing we could consider if we think most people will stick with the default of wait_for_checkpoint: true is default this variable to true when the first DataFrameTransformState is created when the persistent task is first created. Then it won't need updating in the common case, only if people have overridden wait_for_checkpoint: false.

droberts195 · 2019-08-13T12:52:11Z

...src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/DataFrameTransformState.java

+        return shouldStopAtCheckpoint;
+    }
+
+    public void setShouldStopAtCheckoint(boolean shouldStopAtCheckpoint) {


Objects that are stored in the cluster state are supposed to be immutable. This class was already breaching that rule for node. It doesn't cause a problem given the way it's used because DataFrameTransformState objects are deconstructed in the constructor of DataFrameTransformTask and newly constructed in DataFrameTransformTask.getState(). So the only long term DataFrameTransformState objects are in the cluster state itself. But I do wonder whether we should make this more idiomatic now as it seems like a problem waiting to happen when a future maintainer doesn't realise all the subtleties that are going on here. Instead DataFrameTransformState could have a builder that can be initialized with an existing object, then have one field changed and then build a new object. Alternatively it could have a copy constructor that allows everything to be copied except node or shouldStopAtCheckpoint, although now there are two fields that might need to be overridden a builder is probably better. Alternatively since DataFrameTransformTask.getState() constructs a new object that could have an overload that lets you specify the bits you want to be different from the current state.

The reason it's dangerous to have a mutable object in the cluster state is this:

You have a reference to an object of a type that's stored in the cluster state

You update that object

You know that it needs to be updated in the cluster state of all nodes, so pass the updated object to the cluster state update API to do that

The cluster state update API receives your change request, checks to see if the local cluster state has changed, and only if so broadcasts the change to all nodes

The problem arises if the reference in step 1 referred to the actual object in the local cluster state. If it did then the check for changes in step 4 won't find any changes because when you updated your object that was of a type that's stored in the cluster state it actually did update the local cluster state. This then leads to the cluster state of the current node being different to the cluster state of all the other nodes, and you'll never find out from testing in a single node cluster.

droberts195 · 2019-08-13T12:57:18Z

...src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/DataFrameTransformState.java

        this.reason = reason;
        this.progress = progress;
        this.node = node;
+        this.shouldStopAtCheckpoint = shouldStopAtCheckpoint;


This is not included in the X-Content representation. Is there a good reason for that?

I guess it's because we don't want this to go into the version of this object that gets persisted to the index as part of a DataFrameTransformStoredDoc? But omitting it from the X-Content representation also means it won't survive in cluster state if there's a full cluster restart.

There are other comments saying that in 8.x we want to replace this with a STOPPING enum value. But that would be persisted both in the DataFrameTransformStoredDoc and in the on-disk version of the cluster state. So there's an inconsistency here.

@droberts195
If the clusterstate stores itself as XContent, how does it know how to deserialize the objects?

Also, if we are going to store this in the index, we may just want to bite the bullet and figure out how to ONLY store it in the index.

davidkyle

This is tricky code to reason about I will probably need another review. Looks ok, left a few comments.

Regarding the default wait_for_checkpoint: true, it makes sense as the default behaviour but it does significantly change how stop works for continuous DFs. The change is for the best so I'm +1 to it

davidkyle · 2019-08-13T10:29:46Z

...ests/src/test/java/org/elasticsearch/xpack/dataframe/integration/DataFrameIntegTestCase.java

            bulk.add(new IndexRequest().source(sourceBuilder.toString(), XContentType.JSON));

-            if (i % 50 == 0) {
+            if (i % 100 == 0) {


👍 this could probably be 1000

davidkyle · 2019-08-13T10:33:04Z

...-tests/src/test/java/org/elasticsearch/xpack/dataframe/integration/DataFrameTransformIT.java

+        assertTrue(startDataFrameTransform(config.getId(), RequestOptions.DEFAULT).isAcknowledged());
+
+        // waitForCheckpoint: true should make the transform continue until we hit the first checkpoint, then it will stop
+        stopDataFrameTransform(transformId, false, null, true);


What happens if the checkpoint is hit before this stop request lands. What happens if we are using a continuous DF which is at checkpoint but there is no new data (getIndexer().sourceHasChanged() == false), how does the DF get from STOPPING to STOPPED

@davidkyle you are 100% correct, there is a bad race condition here that could cause the transform to just stay running after the user called _stop.

onFinish is called and has made it past the if (transformTask.shouldStopAtCheckpoint) { check

The flag is set to true by the user

stop is called and processed, but since the state is INDEXING nothing is done

onFinish completes sets the state to STARTED

The trick here is that the indexer transitions to STARTED and can get triggered again even off of failures. I think this also shows a bug in how we handle triggers to begin with. If we have not completed a checkpoint, I am not sure we should even check for changes against the indices more than once per checkpoint...

Let me mull this over

I think something will have to be added to doSaveState to handle this race condition.

When doSaveState is called after onFinish the indexer state is then STARTED. If the indexer state is STARTED and shouldStopAtCheckpoint == true that should give some indication of the desired behavior. Though, this may cause other issues as doSaveState is called when we hit an intermittent failure :(.

More careful thought is necessary for this one.

davidkyle · 2019-08-13T10:40:53Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

                getTransformId()));
            return;
        }
+        if (shouldStopAtCheckpoint) {


This covers the case where the DF is stopping with wait_for_completion=true then someone tries to start it again via the API?

…om:benwtrent/elasticsearch into feature/ml-df-add-wait_for_checkpoint-flag

benwtrent · 2019-08-28T19:56:18Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java


-        if (taskState.get() == DataFrameTransformTaskState.FAILED) {
-            logger.debug("[{}] schedule was triggered for transform but task is failed. Ignoring trigger.", getTransformId());
+        if (taskState.get() == DataFrameTransformTaskState.FAILED || taskState.get() == DataFrameTransformTaskState.STOPPED) {


having a trigger occur when taskState.get() == DataFrameTransformTaskState.STOPPED is not really possible. We don't ever transition this state to STOPPED in the task any longer. I put this check in as insurance as we should not trigger on a stopped task.

benwtrent · 2019-08-28T19:57:56Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

-            // Since we save the state to an index, we should make sure that our task state is in parity with the indexer state
-            if (indexerState.equals(IndexerState.STOPPED)) {
-                transformTask.setTaskStateStopped();
-            }


I removed this transition to STOPPED for the task state as there was no need, and it opened a window for start to be called again while we were in the middle of completing doSaveState while stopping.

benwtrent · 2019-08-29T21:05:43Z

This work is essentially blocked by #46156 . Much of the boilerplate will stay the same, but we will be handling state differently. If we have optimistic concurrency protection for doSaveState we don't really have to use ClusterState for this flag. We could simply error out and tell the user to try again if the update does not occur.

[ML][Data Frame] add support for wait_for_checkpoint flag on _stop

195d0f3

benwtrent added >enhancement v8.0.0 :ml/Transform Transform v7.4.0 labels Aug 12, 2019

benwtrent commented Aug 12, 2019

View reviewed changes

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

c8c56f3

droberts195 reviewed Aug 13, 2019

View reviewed changes

davidkyle reviewed Aug 13, 2019

View reviewed changes

benwtrent added 4 commits August 16, 2019 11:57

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

1356bfd

Merge branch 'feature/ml-df-add-wait_for_checkpoint-flag' of github.c…

c08deb5

…om:benwtrent/elasticsearch into feature/ml-df-add-wait_for_checkpoint-flag

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

ea4cd47

intermediate commit

d87f3ac

benwtrent removed the v7.4.0 label Aug 28, 2019

benwtrent added 3 commits August 28, 2019 10:39

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

5482584

minor fix

a8dcefb

further bug fixes and race condition fixes

b467f22

benwtrent commented Aug 28, 2019

View reviewed changes

addressing formatting concerns

8ef6160

droberts195 added the v7.5.0 label Sep 2, 2019

benwtrent added 5 commits September 3, 2019 11:12

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

885aed2

moving to save state into index and not cluster state

0acffb3

adding waitforcheckpoint integration test;

10bd717

fixing stop logic

ffd8acb

fixing request name in test

33a4096

benwtrent added 7 commits September 5, 2019 16:01

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

0ef6e34

dont wait for tasks completion if there are failures

e8db75c

Building intelligible error from failures, fixing yml tests

4a83849

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

59f509c

updating assertion

58d1336

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

ba99096

Merge branch 'master' into feature/ml-df-add-wait_for_checkpoint-flag

e3e1c50

benwtrent closed this Oct 11, 2019

benwtrent mentioned this pull request Oct 11, 2019

[ML][Transforms] add wait_for_checkpoint flag to stop #47935

Merged

benwtrent deleted the feature/ml-df-add-wait_for_checkpoint-flag branch October 11, 2019 17:05

Mpdreamz mentioned this pull request Nov 19, 2019

[meta] 7.5 release elastic/elasticsearch-net#4232

Closed

24 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

[ML][Data Frame] add support for wait_for_checkpoint flag on _stop API #45469

[ML][Data Frame] add support for wait_for_checkpoint flag on _stop API #45469

Uh oh!

Conversation

benwtrent commented Aug 12, 2019

Uh oh!

elasticmachine commented Aug 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Aug 12, 2019

Uh oh!

benwtrent commented Aug 13, 2019

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkyle Aug 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Aug 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[ML][Data Frame] add support for `wait_for_checkpoint` flag on `_stop` API #45469

[ML][Data Frame] add support for `wait_for_checkpoint` flag on `_stop` API #45469

davidkyle Aug 13, 2019 •

edited

Loading