Wait on shard failures

Currently when executing an action (e.g., bulk, delete, or indexing operations) on all shards, if an exception occurs while executing the action on a replica shard we send a shard failure message to the master. However, we do not wait for the master to acknowledge this message and do not handle failures in sending this message to the master. This is problematic because it means that we will acknowledge the action and this can result in losing writes. For example, in a situation where a primary is isolated from the master and its replicas, the following sequence of events can occur:
1. we write to the local primary
2. we fail to write to the replicas
3. we fail in notifying the master to fail the replicas
4. the primary acknowledges the write to the client
5. the master notices the primary is gone and promotes one of the replicas to be primary

In this case, the replica will not have the write that was acknowledged to the client and this amounts to data loss.

Instead, if we waited on the master to acknowledge the shard failures we would never have acknowledged the write to the client in this case.
- [x] Create listener mechanism for executing callbacks when exceptions occur sending a shard failure message to the master #14295
- [x] Add unit tests that show we wait until failure or success (do not have to handle the failures yet) #14707
- [x] Add general support for cluster state batch updates #14899 
- [x] Apply cluster state batch updates to shard failures #15016
- [x] Handle when the node we thought was the master is no longer the master (e.g., master might have stepped down) -> find the actual master (e.g., wait for a new master to be elected) and retry the failed shard notice #15748
- [x] Fail shard failure requests from illegal sources #16275 
- [x] Master tells us we are no longer the primary -> fail the local shard, retry request on new primary #16415 
- [x] Handle failed shard has already been removed from the routing table -> okay #16089 
- [x] Handle master side of shard failures (do not respond to the node until the new cluster state is published, otherwise report failure or allow the node to timeout) #15468


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wait on shard failures #14252

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wait on shard failures #14252

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions