Skip to content

Batch up failure-related ILM master tasks #81880

@DaveCTurner

Description

@DaveCTurner

In #78547 we introduced batching for the ILM master tasks that occur on the happy path. However if a high-shard-count cluster encounters problems while doing ILM-related things—perhaps some nodes are temporarily unavailable for taking a snapshot—then we process the resulting ilm-retry-failed-step and ilm-move-to-error-step tasks one-by-one which can significantly delay the cluster's recovery from its problems.

We should batch these things together too.

It looks like we also enqueue duplicate ilm-retry-failed-step on each poll interval too, although we do appear to treat the duplicates as no-ops at execution time.

Relates #77466

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions