-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
There are still scenarios in which RolloverAction can incorrectly cause the index to be moved to the Error Step, particularly at low polling intervals, such as those we use in our tests. During a rollover, there is a gap between each of these steps:
- The creation of the new index.
- The alias being swapped to the new index.
- The
RolloverInfobeing added to the original index.
#35168 prevented errors in the case where two RolloverSteps fired simultaneously for the same index, and one completed step 1 above before the other. However, an error can still occur if two RolloverSteps (R1 and R2) fire simultaneously in this case:
- R1 completes step 1.
- R2 issues a rollover request, fails with
ResourceAlreadyExistsExceptionand moves to the next step. - R2 finishes, and moves on to running
UpdateRolloverLifecycleDateStep. R2 fails because R1 has not yet completed step 3 (i.e. has not yet attached theRolloverInfo) and the index moves to the error step. - R1 completes steps 2 and 3.
This is the root cause of #35244
Proposed Solution
@colings86, @talevy and I discussed this briefly. The solution we came up with is to add a step (say, VerifyRolloverStep), which checks that the alias no longer points to the original index and that the RolloverInfo has been attached to the original index before moving on to UpdateRolloverLifecycleDateStep.
However, there is no way to detect the difference between "There is a Rollover request in flight that has not yet completed, so we should wait" and "Someone manually created a subsequent index, and we should move to the error step" - therefore, if someone manually creates a subsequent index, the policy would simply wait at VerifyRolloverStep forever. To resolve this, VerifyRolloverStep should implement a timeout, based on step_time, to wait some length of time significantly longer than we believe a rollover should take to complete (10 minutes?) which will allow us to notify the user that something has, in fact, gone wrong.