[ILM] Fix Move To Step API causing ILM to hang #34618

gwbrown · 2018-10-18T20:17:45Z

The Move To Step API now checks to see if the target step is an
AsyncActionStep, and if so, runs it.

AsyncActionSteps are otherwise only run when they are entered by
executing the previous step, rather than periodically or on cluster state
updates, so if an AsyncActionStep was entered via the Move To Step API, ILM
would never touch that index again.

Fixes #34294

The Move To Step API now checks to see if the target step is an AsyncActionStep, and if so, runs it. Previously, AsyncActionSteps would only be run when they are entered by executing the previous step, so if an AsyncActionStep was entered via the Move To Step API, ILM would never touch that index again.

elasticmachine · 2018-10-18T20:17:48Z

Pinging @elastic/es-core-infra

talevy · 2018-10-19T03:44:18Z

...m/src/main/java/org/elasticsearch/xpack/indexlifecycle/action/TransportMoveToStepAction.java

+                @Override
+                public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
+                    IndexMetaData newIndexMetaData = newState.metaData().index(indexMetaData.getIndex());
+                    if (newIndexMetaData == null) {


can this occur due to batching of updates?

After talking with @DaveCTurner it seems like we won't have any batching here because batching occurs within the same instance of ClusterStateTaskExecutor and we don't implement batching ourselves here. However I think this check is still nice to have in case there are other factors at play.

Can this be assert newIndexMetaData != null?

I like David's suggestion here, as it suggests this should never happen more strongly than an if

An assert is less strong though? because the check will not be done in production code, only in tests.

If we can envisage any scenarios where the newState passed to this method can be different to the state we returned in execute() then I think this whould stay as an if statement so we don't end up in a situation where we have a NPE thrown here because the index was deleted. IF we are confident that this kind of scenario should never occur and assert is fine.

right. I guess I don't see this hurting, so I won't block it, but it may be misleading to people new to the code to walk through state in which this may be possible.

We could add something like assert false : "there should be no opportunity for the index to be deleted" inside the if - that way we can catch it in testing while still handling it in production if there's a case we missed. Does that sound reasonable, or is it too messy?

I think we should leave this with the if statement so we are protected against NPEs. If we also want to add an assert to catch things in tests then that fine but I think the protection against a NPE in production should remain

colings86

Left a reply on your comment but this LGTM

colings86 · 2018-10-19T07:48:27Z

...m/src/main/java/org/elasticsearch/xpack/indexlifecycle/action/TransportMoveToStepAction.java

+                @Override
+                public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
+                    IndexMetaData newIndexMetaData = newState.metaData().index(indexMetaData.getIndex());
+                    if (newIndexMetaData == null) {


After talking with @DaveCTurner it seems like we won't have any batching here because batching occurs within the same instance of ClusterStateTaskExecutor and we don't implement batching ourselves here. However I think this check is still nice to have in case there are other factors at play.

gwbrown · 2018-10-24T22:18:13Z

@elasticmachine retest this please

The Move To Step API now checks to see if the target step is an AsyncActionStep, and if so, runs it. Previously, AsyncActionSteps would only be run when they are entered by executing the previous step, so if an AsyncActionStep was entered via the Move To Step API, ILM would never touch that index again.

gwbrown added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Oct 18, 2018

gwbrown requested review from colings86, dakrone and talevy October 18, 2018 20:17

talevy reviewed Oct 19, 2018

View reviewed changes

colings86 approved these changes Oct 19, 2018

View reviewed changes

gwbrown added 2 commits October 24, 2018 15:23

Review comments

b6d96b0

Merge branch 'index-lifecycle' into ilm/fix-move-to-step

d82ab3d

Merge branch 'index-lifecycle' into ilm/fix-move-to-step

21e93a0

gwbrown merged commit f6ac0e4 into elastic:index-lifecycle Oct 29, 2018

gwbrown added the backport pending label Oct 29, 2018

gwbrown removed the backport pending label Oct 29, 2018

gwbrown mentioned this pull request Oct 29, 2018

[ILM] Move to step API does not work if moving to a new phase #34294

Closed

gwbrown mentioned this pull request Nov 8, 2018

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

Closed

gwbrown deleted the ilm/fix-move-to-step branch December 7, 2018 04:56

[ILM] Fix Move To Step API causing ILM to hang #34618

[ILM] Fix Move To Step API causing ILM to hang #34618

Uh oh!

Conversation

gwbrown commented Oct 18, 2018

Uh oh!

elasticmachine commented Oct 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colings86 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwbrown commented Oct 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants