Skip to content

Conversation

@talevy
Copy link
Contributor

@talevy talevy commented Jul 19, 2018

Since the reason for a step not being found in a registry may be due to staleness of the
registry between it and the cluster state, we do not want to throw an IllegalStateException.

Staleness is something that will be self-healing after follow-up applications of the cluster state
updates, so this is a recoverable issue that should log a warning instead of throwing an exception

Closes #32181.

… runPolicy

Since the reason for a step not being found in a registry may be due to staleness of the
registry between it and the cluster state, we do not want to throw an IllegalStateException.

Staleness is something that will be self-healing after follow-up applications of the cluster state
updates, so this is a recoverable issue that should log a warning instead of throwing an exception

Closes elastic#32181.
@talevy talevy added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Jul 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

"current step for index [" + indexMetaData.getIndex().getName() + "] with policy [" + policy + "] is not recognized");
// This may happen in the case that there is invalid ilm-step index settings or the stepRegistry is out of
// sync with the current cluster state
logger.warn("current step [" + getCurrentStepKey(indexSettings) + "] for index [" + indexMetaData.getIndex().getName()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I love that we want to provide the step info here, I have mixed feelings about the assertions that exist inside of getCurrentStepKey. I think this should be changed to throwing IllegalStateException. what do you think @colings86?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assertions are fine because ILM is in full control of the phase, action and step settings (they are INTERNAL settings so can't be touched by a user and will soon be moved to a custom index metadata object further locking down access to them. Therefore, I don't think its necessary to check that if one is set all three are set in production code but its useful to have the check in testing to make sure we don't do something silly so assertions feel right to me.

+ LifecycleSettings.LIFECYCLE_SKIP + "== true");
return;
}
Step currentStep = getCurrentStep(stepRegistry, policy, indexSettings);
Copy link
Contributor Author

@talevy talevy Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked closer at this. I missed something here.

This method calls PolicyStepsRegistry#getStep, which chooses to throw IllegalStateExceptions as well. Further investigation to see what the repercussions of changing that is necessary.

and tests need to be added to walk these paths as well. Current tests do not catch this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need to make PolicyStepsRegistry.getStep() return null if the step is missing (in much the same way that a map returns null for a missing key. Then we'll have to check for null:

  • here in IndexLifecycleRunner.runPolicy() and if its null log a warning and return
  • In IndexLifecycleRunner.moveClusterStateToStep() where I think we should do the same as above
  • In ExecuteStepsUpdateTask.execute() where we should log the warning if its null but return the cluster state up to changing the current step so we don't lose all the progress made on the previous cluster state steps

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is what I was thinking as well. Just wanted to double check with you before I take that route

@talevy talevy added the WIP label Jul 19, 2018
@talevy talevy requested a review from colings86 July 20, 2018 15:39
@talevy talevy removed the WIP label Jul 24, 2018
@talevy talevy requested review from colings86 and dakrone and removed request for colings86 July 26, 2018 01:04
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talevy I left a few comments

"current step for index [" + indexMetaData.getIndex().getName() + "] with policy [" + policy + "] is not recognized");
// This may happen in the case that there is invalid ilm-step index settings or the stepRegistry is out of
// sync with the current cluster state
logger.warn("current step [" + getCurrentStepKey(indexSettings) + "] for index [" + indexMetaData.getIndex().getName()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assertions are fine because ILM is in full control of the phase, action and step settings (they are INTERNAL settings so can't be touched by a user and will soon be moved to a custom index metadata object further locking down access to them. Therefore, I don't think its necessary to check that if one is set all three are set in production code but its useful to have the check in testing to make sure we don't do something silly so assertions feel right to me.

throw new IllegalArgumentException(e.getMessage());
Step nextStep = stepRegistry.getStep(indexPolicySetting, nextStepKey);
if (nextStep == null) {
throw new IllegalArgumentException("step [" + nextStepKey + "] with policy [" + indexPolicySetting + "] does not exist");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this still needs to throw an exception rather than just returning an unmodified cluster state? Also I know we threw an IllegalArgumentException before but it feels wrong to me since the user won't have supplied anything invalid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am cool with that.

I still feel like IllegalArgumentException should be thrown to users of TransportMoveToStepAction. So I will push up a basic exception for that, and it will know that nothing "right" happened since the clusterStates will be the same instance

if (nextStep == null) {
throw new IllegalArgumentException("step [" + nextStepKey + "] with policy [" + indexPolicySetting + "] does not exist");
// stepRegistry may not be up-to-date with latest policies/steps in cluster-state
return currentState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had not looked closely enough at what this method is used for. It seems that its only used for the "move to step" and retry APIs so actually its probably ok for this method to throw an exception directly. I had thought before that we use this method for moving between steps normally.

I think we should throw directly here as the alternative of returning the same cluster state and then using that as an indication that the step is not recognised is a bit trappy. To ensure this method is kept only for API calls and not for normal execution maybe we can add a JavaDoc comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will revert the latest commit and add comment

@talevy talevy requested a review from colings86 July 30, 2018 18:44
Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@talevy talevy merged commit 6e9f338 into elastic:index-lifecycle Jul 31, 2018
@talevy talevy deleted the ilm-no-illegal-state branch July 31, 2018 14:06
@talevy
Copy link
Contributor Author

talevy commented Jul 31, 2018

thanks!

jasontedor pushed a commit that referenced this pull request Aug 17, 2018
Since the reason for a step not being found in a registry may be due to staleness of the
registry between it and the cluster state, we do not want to throw an IllegalStateException.

Staleness is something that will be self-healing after follow-up applications of the cluster state
updates, so this is a recoverable issue that should log a warning instead of throwing an exception

Closes #32181.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants