-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Change step execution flow to be deliberate about type #34126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change step execution flow to be deliberate about type #34126
Conversation
This commit changes the way that step execution flows. Rather than have any step run when the cluster state changes or the periodic scheduler fires, this now runs the different types of steps at different times. `AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default `ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the cluster state changes. `AsyncActionStep` is now run only after the cluster state has been transitioned into a new step. This prevents these non-idempotent steps from running at the same time. It addition to being run when transitioned into, this is also run when a node is newly elected master (only if set as the current step) so that master failover does not fail to run the step. This also changes the `RolloverStep` from an `AsyncActionStep` to an `AsyncWaitStep` so that it can run periodically. Relates to elastic#29823
|
Pinging @elastic/es-core-infra |
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dakrone I left some comments
| Objects.equals(maxDocs, other.maxDocs); | ||
| } | ||
|
|
||
| // TODO: expand the information we provide? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any information from the current RolloverResponse that would be helpful here. The closest is the conditionStatus but since this is a Map<String, Boolean> it doesn't provide anything useful when the index is not rolled over because all the conditions will be false. Maybe we should open an issue for discussion suggesting to add more information to the conditionStatus in the RolloverResponse so it exposes the age, number of documents, and store size it found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a separate issue for that, good idea.
| index.getName(), currentStep.getClass().getSimpleName(), currentStep.getKey(), currentStep.getNextStepKey()); | ||
| currentState = ((ClusterStateActionStep) currentStep).performAction(index, currentState); | ||
| if (currentStep.getNextStepKey() == null) { | ||
| return currentState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need to set nextStepKey to null here so in the clusterStateProcessed method we don't end up doing the wrong thing? I think for now it would actually work because the nextStepKey would never be an AsyncAction in this scenario but probably worth correcting this so bugs don't creep in the future becuase the nextStepKey is still pointing at the previous step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextStepKey starts out by being null, so this will already do the right thing. I'll add an explicit initialization to null in the method though, so it's more apparent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this task potentially loops through multiple steps right? So if you have a cluster state step , another cluster state step and then null next step will get step on the first step and then remain set here when the second step's next step is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point, I'll fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think the behavior might be okay,
- CSAS (cluster state action step) executes with nextStep
A, moves cluster state to next step (which is also a CSAS) - CSAS executes with nextStep
null, returns cluster state at that point clusterStateProcessedcall is invoked, nextStepKey isA, butAisn't an async action so it's a no-op.
Can you describe the scenario you're worried about in more detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you say is correct right now, I am worried that if the logic in clusterStateProcessed changes in the future we could inadvertently introduce a bug that will be tricky to track down because the nextStepKey is not actually set to the nextStep at this point. I think we should make it so nextStepKey is correct so we avoid this in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've re-organized this so it is hopefully clearer
| ClusterStateWaitStep.Result result = ((ClusterStateWaitStep) currentStep).isConditionMet(index, currentState); | ||
| if (result.isComplete()) { | ||
| if (currentStep.getNextStepKey() == null) { | ||
| return currentState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as above :)
| logger.warn("current step [" + getCurrentStepKey(lifecycleState) + "] for index [" + indexMetaData.getIndex().getName() | ||
| + "] with policy [" + policy + "] is not recognized"); | ||
| logger.warn("current step [{}] for index [{}] with policy [{}] is not recognized", | ||
| getCurrentStepKey(lifecycleState), index, policy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a change for this PR but this should never happen now right as we aren't caching the steps int eh step registry? So we should probably throw an exception if it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get here if the index specifies a lifecycle.name that doesn't exist - I'll be opening a separate PR to handle that case probably today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can revisit this when we remove the policy steps registry
...k/plugin/ilm/src/main/java/org/elasticsearch/xpack/indexlifecycle/IndexLifecycleService.java
Show resolved
Hide resolved
| } else if (currentStep instanceof ClusterStateActionStep || currentStep instanceof ClusterStateWaitStep) { | ||
| logger.debug("[{}] running policy with current-step [{}]", indexMetaData.getIndex().getName(), currentStep.getKey()); | ||
| clusterService.submitStateUpdateTask("ILM", | ||
| new ExecuteStepsUpdateTask(policy, indexMetaData.getIndex(), currentStep, stepRegistry, this, nowSupplier)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we said we don't need ExecuteStepsUpdateTask, but rather run ClusterStateWaitStep on the incoming cluster state and if it matches move to the next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of whether the ClusterStateWaitStep is marked as "complete" or not, we have to issue a cluster state update, so to me we might as well keep it in the midst of the cluster state update?
|
@elasticmachine run the packaging tests please |
…te-step-execution
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment but assuming that is addressed this LGTM
| String policyName = LifecycleSettings.LIFECYCLE_NAME_SETTING.get(idxMeta.getSettings()); | ||
| if (Strings.isNullOrEmpty(policyName) == false) { | ||
| StepKey stepKey = IndexLifecycleRunner.getCurrentStepKey(LifecycleExecutionState.fromIndexMetadata(idxMeta)); | ||
| if (OperationMode.STOPPING == currentMode && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have lost the bit that sets the currentMode to STOPPED if no indices are in the ignore actions list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it below in triggerPolicies() but I think we need it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'll update that to add this.
colings86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
…te-step-execution
This commit changes the way that step execution flows. Rather than have any step run when the cluster state changes or the periodic scheduler fires, this now runs the different types of steps at different times. `AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default `ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the cluster state changes. `AsyncActionStep` is now run only after the cluster state has been transitioned into a new step. This prevents these non-idempotent steps from running at the same time. It addition to being run when transitioned into, this is also run when a node is newly elected master (only if set as the current step) so that master failover does not fail to run the step. This also changes the `RolloverStep` from an `AsyncActionStep` to an `AsyncWaitStep` so that it can run periodically. Relates to #29823
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.
AsyncWaitStepis run at a periodic manner, ie, every 10 minutes by defaultClusterStateActionStepandClusterStateWaitStepare run every time thecluster state changes.
AsyncActionStepis now run only after the cluster state has been transitionedinto a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.
This also changes the
RolloverStepfrom anAsyncActionStepto anAsyncWaitStepso that it can run periodically.Relates to #29823