Skip to content

[Indexing] A network partition can cause in flight documents to be lost #7572

@bleskes

Description

@bleskes

This ticket is meant to capture an issue which was discovered as part of the work done in #7493 , which contains a failing reproduction test with @awaitFix.

If a network partition separates a node from the master, there is some window of time before the node detects it. The length of the window is dependent on the type of the partition. This window is extremely small if a socket is broken. More adversarial partitions, for example, silently dropping requests without breaking the socket can take longer (up to 3x30s using current defaults).

If the node hosts a primary shard at the moment of partition, and ends up being isolated from the cluster (which could have resulted in Split Brain before), some documents that are being indexed into the primary may be lost if they fail to reach one of the allocated replicas (due to the partition) and that replica is later promoted to primary by the master.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions