Skip to content

Wait on shard failures #14252

@jasontedor

Description

@jasontedor

Currently when executing an action (e.g., bulk, delete, or indexing operations) on all shards, if an exception occurs while executing the action on a replica shard we send a shard failure message to the master. However, we do not wait for the master to acknowledge this message and do not handle failures in sending this message to the master. This is problematic because it means that we will acknowledge the action and this can result in losing writes. For example, in a situation where a primary is isolated from the master and its replicas, the following sequence of events can occur:

  1. we write to the local primary
  2. we fail to write to the replicas
  3. we fail in notifying the master to fail the replicas
  4. the primary acknowledges the write to the client
  5. the master notices the primary is gone and promotes one of the replicas to be primary

In this case, the replica will not have the write that was acknowledged to the client and this amounts to data loss.

Instead, if we waited on the master to acknowledge the shard failures we would never have acknowledged the write to the client in this case.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions