-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
To avoid endless assignment loops we currently limit the number of times a shard will be allocated after failure. We only count real failures for these and ignore things like nodes dropping of the cluster. However, if the index is actively being indexed while a node is disconnected, the primary will request the master to fail the shard so it can ack the indexing operation. If that shard failure request reaches the master before the master process the node leaving the cluster, we increment the shard failure counter. If this happens repeatedly the shard will no longer be assigned and tests time out.
This is the cause of the failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+dockeralpine-periodic/167/console
Since we designed the shard failure counter to protect against broken allocations (missing synonyms files etc.), we shouldn't count failures coming from the primary. This does come with the down side that we are not protected against partial network disconnects that the master doesn't see - these may then cause a replica to be allocated again and again.