Skip to content

Conversation

@gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Nov 1, 2018

If the Rollover step would fail due to the next index in sequence
already existing, just skip to the next step instead of going to the
Error step.

This prevents spurious ResourceAlreadyExistsExceptions created by
simultaneous RolloverStep executions from causing ILM to error out
unnecessarily.

Resolves #34465 - I tested this fix manually against a setup that otherwise
reliably reproduced that issue for a few hours, and did not encounter any
failures due to an unnecessary ResourceAlreadyExistsException.

However, I'm still somewhat concerned that the error message a user encounters
if they really do have a problem with the index existing isn't clear - see the test case
I added (testRolloverAlreadyExists) to see what I mean. We could change that
message to be clearer in this case, but that may cause the error to be less clear if
the RolloverInfo is null via some other way. Any thoughts?

If the Rollover step would fail due to the next index in sequence
already existing, just skip to the next step instead of going to the
Error step.

This prevents spurious `ResourceAlreadyExistsException`s created by
simultaneous RolloverStep executions from causing ILM to error out
unnecessarily.
@gwbrown gwbrown added >bug blocker :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

logger.info(secondIndex + ": " + getStepKeyForIndex(secondIndex));
assertThat(getStepKeyForIndex(originalIndex), equalTo(new StepKey("hot", RolloverAction.NAME, ErrorStep.NAME)));
assertThat(getFailedStepForIndex(originalIndex), equalTo("update-rollover-lifecycle-date"));
assertThat(getReasonForIndex(originalIndex), equalTo("index [" + originalIndex + "] has not rolled over yet"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this error message is confusing. Just to check I understand correctly. The problem with the second index is actually that it doesn't have the rollover alias setting set and doesn't have any rollover info?

Copy link
Contributor Author

@gwbrown gwbrown Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example: If you have test-000001 and manually create test-000002 before rollover fires (and don't do anything else), then test-000001 will end up in the error state with this message because test-000001 has not been rolled over, and therefore doesn't have any attached RolloverInfo. test-000001 is still the target of the alias and still has the rollover alias setting.

If test-000002 had been manually created and the alias had been manually switched to point to test-000002, attempt_rollover would fail due to #35065

Does that help clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just change the message in this step to say something like:

No rollover info found for index [INDEX_NAME]. Either the index has not yet rolled over or a subsequent index was created outside of Index Lifecycle Management. 

@gwbrown gwbrown changed the base branch from index-lifecycle to master November 2, 2018 13:58
@gwbrown
Copy link
Contributor Author

gwbrown commented Nov 2, 2018

CI failure seems unrelated and does not reproduce locally.

@elasticmachine test this please

@gwbrown gwbrown merged commit 0fbb8a1 into elastic:master Nov 5, 2018
gwbrown added a commit that referenced this pull request Nov 5, 2018
If the Rollover step would fail due to the next index in sequence
already existing, just skip to the next step instead of going to the
Error step.

This prevents spurious `ResourceAlreadyExistsException`s created by
simultaneous RolloverStep executions from causing ILM to error out
unnecessarily.
@gwbrown
Copy link
Contributor Author

gwbrown commented Nov 5, 2018

Whoops, just realized I got mixed up and this didn't actually get any official approvals from reviewers - my bad. Let me know if there were more changes you wanted me to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ILM] Rollover action errors after restart

4 participants