-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML][Data Frame] add support for wait_for_checkpoint flag on _stop API
#45469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Data Frame] add support for wait_for_checkpoint flag on _stop API
#45469
Conversation
|
Pinging @elastic/ml-core |
| listener.onResponse(new StopDataFrameTransformAction.Response(Boolean.TRUE)); | ||
| // To cover strange state race conditions, we adjust the variable first (which writes to cluster state if it is different) | ||
| // then we stop the transform | ||
| transformTask.setShouldStopAtCheckpoint(request.isWaitForCheckpoint(), ActionListener.wrap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default value for wait_for_checkpoint is true and folks could call this against _all transforms, this could get pretty expensive as there will be a cluster state update for each individual task.
Do we want to reconsider the default value of true or do we think this behavior is acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do a cluster state update for each transform when _all are stopped, because removing each persistent task is a cluster state update.
With this change, if there are 50 started transforms then we'll go from doing 50 cluster state updates when stopping _all to 100: 50 to set the flags and 50 later on to remove the tasks. So it's not like we're going from being completely independent of cluster state to suddenly creating a large cluster state update load.
One thing we could consider if we think most people will stick with the default of wait_for_checkpoint: true is default this variable to true when the first DataFrameTransformState is created when the persistent task is first created. Then it won't need updating in the common case, only if people have overridden wait_for_checkpoint: false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difficulty with having it default to true within the task, is now we will need two booleans:
- one that indicates that when we stop, we should wait for the checkpoint
- one that indicates that we are stopping
Because checkpoints are completed when onFinish is called, we need a way to determine that we are "stopping" and the task should complete. In this PR, we are relying on the flag being set to indicate that the task should stop when onFinish is completed.
| getTransformId())); | ||
| return; | ||
| } | ||
| if (shouldStopAtCheckpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want start to be called if we are shouldStopAtCheckpoint.
Honestly, calling start here is pretty harmless, but I think it is better to throw to alert the user that "hey, we are stopping this thing soon".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers the case where the DF is stopping with wait_for_completion=true then someone tries to start it again via the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @davidkyle , as I said, calling start here should not cause problems, but if the user is attempting to start it after somebody already called _stop?wait_for_checkpoint=true, I think we should throw an error indicating that we are stopping
| } | ||
|
|
||
| if (getIndexer().getState() == IndexerState.STOPPED) { | ||
| if (getIndexer().getState() == IndexerState.STOPPED || getIndexer().getState() == IndexerState.STOPPING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have always been here, we should not allow the execution path to continue if we are already STOPPING
|
|
||
| logger.debug("Data frame indexer [{}] schedule has triggered, state: [{}]", event.getJobName(), getIndexer().getState()); | ||
| // If we are failed, don't trigger | ||
| if (taskState.get() == DataFrameTransformTaskState.FAILED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having these checks earlier (they are done lower down in the indexer as well) allow for simpler reasoning around what is done within these synchronized methods of stop, start, and triggered
| nextCheckpoint = null; | ||
| // Reset our failure count as we have finished and may start again with a new checkpoint | ||
| failureCount.set(0); | ||
| transformTask.stateReason.set(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are finishing, the state reason should be cleared out regardless.
|
run elasticsearch-ci/bwc |
|
@elasticmachine update branch |
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are quite a few tricky issues to think about with this. I left a few comments. Maybe we also need to discuss it further as a group on the weekly call.
| listener.onResponse(new StopDataFrameTransformAction.Response(Boolean.TRUE)); | ||
| // To cover strange state race conditions, we adjust the variable first (which writes to cluster state if it is different) | ||
| // then we stop the transform | ||
| transformTask.setShouldStopAtCheckpoint(request.isWaitForCheckpoint(), ActionListener.wrap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do a cluster state update for each transform when _all are stopped, because removing each persistent task is a cluster state update.
With this change, if there are 50 started transforms then we'll go from doing 50 cluster state updates when stopping _all to 100: 50 to set the flags and 50 later on to remove the tasks. So it's not like we're going from being completely independent of cluster state to suddenly creating a large cluster state update load.
One thing we could consider if we think most people will stick with the default of wait_for_checkpoint: true is default this variable to true when the first DataFrameTransformState is created when the persistent task is first created. Then it won't need updating in the common case, only if people have overridden wait_for_checkpoint: false.
| return shouldStopAtCheckpoint; | ||
| } | ||
|
|
||
| public void setShouldStopAtCheckoint(boolean shouldStopAtCheckpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects that are stored in the cluster state are supposed to be immutable. This class was already breaching that rule for node. It doesn't cause a problem given the way it's used because DataFrameTransformState objects are deconstructed in the constructor of DataFrameTransformTask and newly constructed in DataFrameTransformTask.getState(). So the only long term DataFrameTransformState objects are in the cluster state itself. But I do wonder whether we should make this more idiomatic now as it seems like a problem waiting to happen when a future maintainer doesn't realise all the subtleties that are going on here. Instead DataFrameTransformState could have a builder that can be initialized with an existing object, then have one field changed and then build a new object. Alternatively it could have a copy constructor that allows everything to be copied except node or shouldStopAtCheckpoint, although now there are two fields that might need to be overridden a builder is probably better. Alternatively since DataFrameTransformTask.getState() constructs a new object that could have an overload that lets you specify the bits you want to be different from the current state.
The reason it's dangerous to have a mutable object in the cluster state is this:
- You have a reference to an object of a type that's stored in the cluster state
- You update that object
- You know that it needs to be updated in the cluster state of all nodes, so pass the updated object to the cluster state update API to do that
- The cluster state update API receives your change request, checks to see if the local cluster state has changed, and only if so broadcasts the change to all nodes
The problem arises if the reference in step 1 referred to the actual object in the local cluster state. If it did then the check for changes in step 4 won't find any changes because when you updated your object that was of a type that's stored in the cluster state it actually did update the local cluster state. This then leads to the cluster state of the current node being different to the cluster state of all the other nodes, and you'll never find out from testing in a single node cluster.
| this.reason = reason; | ||
| this.progress = progress; | ||
| this.node = node; | ||
| this.shouldStopAtCheckpoint = shouldStopAtCheckpoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not included in the X-Content representation. Is there a good reason for that?
I guess it's because we don't want this to go into the version of this object that gets persisted to the index as part of a DataFrameTransformStoredDoc? But omitting it from the X-Content representation also means it won't survive in cluster state if there's a full cluster restart.
There are other comments saying that in 8.x we want to replace this with a STOPPING enum value. But that would be persisted both in the DataFrameTransformStoredDoc and in the on-disk version of the cluster state. So there's an inconsistency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195
If the clusterstate stores itself as XContent, how does it know how to deserialize the objects?
Also, if we are going to store this in the index, we may just want to bite the bullet and figure out how to ONLY store it in the index.
davidkyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky code to reason about I will probably need another review. Looks ok, left a few comments.
Regarding the default wait_for_checkpoint: true, it makes sense as the default behaviour but it does significantly change how stop works for continuous DFs. The change is for the best so I'm +1 to it
| bulk.add(new IndexRequest().source(sourceBuilder.toString(), XContentType.JSON)); | ||
|
|
||
| if (i % 50 == 0) { | ||
| if (i % 100 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this could probably be 1000
| assertTrue(startDataFrameTransform(config.getId(), RequestOptions.DEFAULT).isAcknowledged()); | ||
|
|
||
| // waitForCheckpoint: true should make the transform continue until we hit the first checkpoint, then it will stop | ||
| stopDataFrameTransform(transformId, false, null, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the checkpoint is hit before this stop request lands. What happens if we are using a continuous DF which is at checkpoint but there is no new data (getIndexer().sourceHasChanged() == false), how does the DF get from STOPPING to STOPPED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkyle you are 100% correct, there is a bad race condition here that could cause the transform to just stay running after the user called _stop.
onFinishis called and has made it past theif (transformTask.shouldStopAtCheckpoint) {check- The flag is set to
trueby the user stopis called and processed, but since the state isINDEXINGnothing is doneonFinishcompletes sets the state toSTARTED
The trick here is that the indexer transitions to STARTED and can get triggered again even off of failures. I think this also shows a bug in how we handle triggers to begin with. If we have not completed a checkpoint, I am not sure we should even check for changes against the indices more than once per checkpoint...
Let me mull this over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something will have to be added to doSaveState to handle this race condition.
When doSaveState is called after onFinish the indexer state is then STARTED. If the indexer state is STARTED and shouldStopAtCheckpoint == true that should give some indication of the desired behavior. Though, this may cause other issues as doSaveState is called when we hit an intermittent failure :(.
More careful thought is necessary for this one.
| getTransformId())); | ||
| return; | ||
| } | ||
| if (shouldStopAtCheckpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers the case where the DF is stopping with wait_for_completion=true then someone tries to start it again via the API?
…om:benwtrent/elasticsearch into feature/ml-df-add-wait_for_checkpoint-flag
|
|
||
| if (taskState.get() == DataFrameTransformTaskState.FAILED) { | ||
| logger.debug("[{}] schedule was triggered for transform but task is failed. Ignoring trigger.", getTransformId()); | ||
| if (taskState.get() == DataFrameTransformTaskState.FAILED || taskState.get() == DataFrameTransformTaskState.STOPPED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a trigger occur when taskState.get() == DataFrameTransformTaskState.STOPPED is not really possible. We don't ever transition this state to STOPPED in the task any longer. I put this check in as insurance as we should not trigger on a stopped task.
| // Since we save the state to an index, we should make sure that our task state is in parity with the indexer state | ||
| if (indexerState.equals(IndexerState.STOPPED)) { | ||
| transformTask.setTaskStateStopped(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this transition to STOPPED for the task state as there was no need, and it opened a window for start to be called again while we were in the middle of completing doSaveState while stopping.
|
This work is essentially blocked by #46156 . Much of the boilerplate will stay the same, but we will be handling state differently. If we have optimistic concurrency protection for |
This adds a new flag
wait_for_checkpointfor the_stoptransform API.The behavior is as follows:
force=true.Implementation details:
_stopright when andoSaveStateaction is being called as the persisted state MAY be overwritten.STOPPINGstate since the enumerations are stored in the index, and attempting to read in a new value via the XContentParser on an old node would blow up.closes #45293