Skip to content

Conversation

@original-brownbear
Copy link
Contributor

This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes #47550

backport of #47552

This fixes missing to marking shard snapshots as failures when
multiple data-nodes are lost during the snapshot process or
shard snapshot failures have occured before a node left the cluster.

The problem was that we were simply not adding any shard entries for completed
shards on node-left events. This has no effect for a successful shard, but
for a failed shard would lead to that shard not being marked as failed during
snapshot finalization. Fixed by corectly keeping track of all previous completed
shard states as well in this case.
Also, added an assertion that without this fix would trip on almost every run of the
resiliency tests and adjusted the serialization of SnapshotsInProgress.Entry so
we have a proper assertion message.

Closes elastic#47550
@original-brownbear original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Oct 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/default-distro

@original-brownbear original-brownbear merged commit 6c9687b into elastic:7.4 Oct 5, 2019
@original-brownbear original-brownbear deleted the 47552-7.4 branch October 5, 2019 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants