-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Data Management/ILM+SLMIndex and Snapshot lifecycle managementIndex and Snapshot lifecycle management>bugTeam:Data ManagementMeta label for data/management teamMeta label for data/management team
Description
In #78547 we introduced batching for the ILM master tasks that occur on the happy path. However if a high-shard-count cluster encounters problems while doing ILM-related things—perhaps some nodes are temporarily unavailable for taking a snapshot—then we process the resulting ilm-retry-failed-step and ilm-move-to-error-step tasks one-by-one which can significantly delay the cluster's recovery from its problems.
We should batch these things together too.
It looks like we also enqueue duplicate ilm-retry-failed-step on each poll interval too, although we do appear to treat the duplicates as no-ops at execution time.
Relates #77466
Metadata
Metadata
Assignees
Labels
:Data Management/ILM+SLMIndex and Snapshot lifecycle managementIndex and Snapshot lifecycle management>bugTeam:Data ManagementMeta label for data/management teamMeta label for data/management team