A primary failing a replica should not count against the shard failures counter

To avoid endless assignment loops we currently limit the number of times a shard will be allocated after failure. We only count real failures for these and ignore things like nodes dropping of the cluster. However, if the index is actively being indexed while a node is disconnected, the primary will request the master to fail the shard so it can ack the indexing operation. If that shard failure request reaches the master before the master process the node leaving the cluster, we increment the shard failure counter.  If this happens repeatedly the shard will no longer be assigned and tests time out.

This is the cause of the failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+dockeralpine-periodic/167/console

Since we designed the shard failure counter to protect against broken allocations (missing synonyms files etc.), we shouldn't count failures coming from the primary. This does come with the down side that we are not protected against partial network disconnects that the master doesn't see - these may then cause a replica to be allocated again and again. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A primary failing a replica should not count against the shard failures counter #20834

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A primary failing a replica should not count against the shard failures counter #20834

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions