Update IndexShardSnapshotStatus when an exception is encountered #32265

backslasht · 2018-07-21T12:56:19Z

Background

We have identified an issue in latest elastic search snapshot code where the snapshot is stuck (making no progress and not get deleted) when one or more shards’ (whose snapshot state is in INIT/STARTED state) are not worked on by the node it is assigned to. This could happen when primary node is different (changed after the snapshot is started) from the node (old primary) to which the shard is marked to be snapshot-ed.

When does it happen

When one of the data nodes having primary shards is restarted (process restart) while the snapshot is running and joins back the cluster within 30 seconds. The node upon restart fails to process the cluster update (due to a race condition between the snapshot service and indices service) and all the shards for which the node was primary (before restart) and in INIT or STARTED state (snapshot state) will be stuck in that state forever.

The shards get stuck as the indices services throws a IndexNotFoundException as it hasn't processed the cluster update yet. And if one of the shards (say x out of y shards that need to snapshot-ed) receives the IndexNotFoundException, SnapshotShardsService fails to queue the snapshot thread for that shard as well for all the following shards (y - x) + 1. The master will keep on waiting for the shard to go to logical end state (DONE or FAILED) and report it back to the master. But since the snapshot thread didn't start for the shard, it will never report back the state and thus snapshot stuck indefinitely.

When a delete call is invoked on the snapshot which is stuck, all the shards which is in INIT or STARTED state will be marked as ABORTED expecting the BlobStoreRepository to throw an exception and move the shard to logically end state FAILED. But since no thread is working on these shards, it will remain as ABORTED and new subsequent delete call will be queued resulting in increase in number of tasks.

Proposed Fix

When IndexNotFoundException is received by the SnapshotShardsService, catch the exception and immediately mark the snapshot shard state to FAILED. After marking it as FAILED, it can still go ahead and process the rest of the shards which will eventually make the snapshot to go into PARTIAL state instead of stuck forever.
When a snapshot delete call is made for a snapshot which is in progress (or stuck), SnaphotShardsService iterate through the shards of the snapshot and marks them as DONE or FAILED if it is complete or error-ed respectively. For the shards which are in INIT or STARTED state, it is marked as ABORTED expecting the thread which is uploading the data to detect the ABORTED state and throw an error, thus going to a logical FAILED state. But, since there are no threads working on these shards (lost during restart), the state remains the same forever. To fix that, while marking the shard status as ABORTED we can do an additional check to see if the shard’s current primary is different than the node to which the shard is marked to be snapshot-ed. If they are different, we can fail the shard immediately thus making all shards to reach a logical state (either DONE or FAILED) which in turn result in successful deletion of the problematic snapshot.
Adding ABORTED state in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats where it is missing.

Steps to reproduce (100% success)

Create a multi node cluster (say with 3 data nodes).
Using load generator, create multiple indices (say 500 indices) each with 1 replica and 3 shards
After the data is loaded, register a repo (my-repo) and initiate a snapshot.
Restart the elastic search process in one of the data nodes.
[Verify] The new process will fail to update the snapshot state from cluster state.
The snapshot goes to stuck state and can’t be deleted.

Related Issues

…e snapshot call is made When snapshot is about to start for a shard which is in one of the following states, currently we don't move the snapshot status to failed, adding code to move the status to failed. i) Not primary in the current node ii) Relocating iii) Recovering iv) Index hasn't loaded yet in indices service Also adding ABORTED status in the IndexShard Status/Stats/Stage.

elasticmachine · 2018-07-21T19:37:22Z

Pinging @elastic/es-distributed

ywelsch · 2018-07-24T10:41:29Z

Thanks for reporting this. I've looked more closely into the series of events leading to this situation:

When a node leaves the cluster, the master will fail all shards allocated to that node. SnapshotsService on the master (which is a ClusterStateApplier) will get the updated cluster state with the removed node and call processSnapshotsOnRemovedNodes, which in turn will submit a cluster state update task to move the snapshot from STARTED/ABORTED to FAILED.
This means that there is a cluster state where there are entries in SnapshotsInProgress which mark certain shards as requiring a snapshot, but where the actual shard routing has been removed from the cluster state. If the node rejoins before the cluster state update task submitted by processSnapshotsOnRemovedNodes has gotten the chance to execute on the master and bring the entries in SnapshotsInProgress back in sync with the routing table, the node will have a snapshot assigned to it with no corresponding IndexShard locally available. It will therefore fail to get the IndexShard from IndicesService in SnapshotShardsService, which leads to an exception on the cluster applier thread, and will in turn prevent other (possibly valid) snapshots to start on the node. On subsequent cluster state updates, these are not resubmitted to the executor, which means that they'll be indefinitely stuck.

The first thing we'll need to do here is to write an integration test that reproduces the issue. Regarding a fix, I would prefer to have a SnapshotsInProgress object that's fully in-sync with the routing table, similar to what I have done here for the RestoreInProgress information, and then build a solution on top of that. I'll explore this further in the next days, just wanted to give you an update here.

backslasht · 2018-07-25T09:51:58Z

Thanks for looking into the change.

In addition to the IndexNotFoundException, I have also found that we are not updating IndexShardSnapshotStatus in cases like:

The node is not primary for that shard anymore.
Shard is currently relocating.

I have fixed it as part of this PR. I have also added missing ABORTED status in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats.

Should I raise a separate a PR for them?

BobBlank12 · 2018-10-03T23:55:37Z

For people struggling with this issue, should we consider temporarily disabling shard allocation before snapshots and enabling it afterwards?

ywelsch · 2018-10-04T08:11:25Z

@BobBlank12 that will unfortunately not help. I've worked on a proper fix (which requires rewriting some core parts of the snapshotting code), but have to break this up now into smaller reviewable pieces. There is unfortunately no workaround for now. If you hit this issue, you can manually solve the problem by following the procedure outlined here: #31624 (comment)

* Bringing in cluster state test infrastructure * Relates elastic#32265

* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265

ywelsch · 2019-01-15T18:24:28Z

We have a quick fix in #36113 and a more comprehensive fix will follow.

ywelsch · 2019-01-15T18:25:44Z

@backslasht can you verify that #36113 fixes the issue?

* Fixes two broken spots: 1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting 2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state * Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk * Significantly extended test infrastructure to reproduce the above two issues * Two new test runs: 1. Reproducing the effects of node disconnects/restarts in isolation 2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes * Relates #32265 * Closes #32348

dnhatn added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jul 21, 2018

ywelsch self-assigned this Jul 24, 2018

ywelsch added the >bug label Jul 24, 2018

original-brownbear self-assigned this Nov 14, 2018

original-brownbear mentioned this pull request Nov 16, 2018

[WIP] Keep SnapshotInProgress State in Sync with Routing Table #35637

Closed

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 23, 2018

SNAPSHOT: Introduce ClusterState Tests

0c5b3e5

* Bringing in cluster state test infrastructure * Relates elastic#32265

This was referenced Nov 23, 2018

SNAPSHOT: Introduce ClusterState Tests #35867

Closed

SNAPSHOT: Improve Resilience SnapshotShardService #36113

Merged

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 14, 2018

SNAPSHOT: Deterministic ClusterState Tests

32d39da

* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265

original-brownbear mentioned this pull request Dec 14, 2018

SNAPSHOT: Deterministic ClusterState Tests #36644

Merged

ywelsch closed this Jan 15, 2019

ywelsch mentioned this pull request Jan 16, 2019

Snapshot stuck in IN_PROGRESS #29118

Closed

original-brownbear mentioned this pull request Jan 22, 2019

Fix Two Races that Lead to Stuck Snapshots #37686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update IndexShardSnapshotStatus when an exception is encountered #32265

Update IndexShardSnapshotStatus when an exception is encountered #32265

Uh oh!

backslasht commented Jul 21, 2018

Uh oh!

elasticmachine commented Jul 21, 2018

Uh oh!

ywelsch commented Jul 24, 2018

Uh oh!

backslasht commented Jul 25, 2018

Uh oh!

BobBlank12 commented Oct 3, 2018

Uh oh!

ywelsch commented Oct 4, 2018

Uh oh!

ywelsch commented Jan 15, 2019

Uh oh!

ywelsch commented Jan 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Update IndexShardSnapshotStatus when an exception is encountered #32265

Update IndexShardSnapshotStatus when an exception is encountered #32265

Uh oh!

Conversation

backslasht commented Jul 21, 2018

Background

When does it happen

Proposed Fix

Steps to reproduce (100% success)

Related Issues

Uh oh!

elasticmachine commented Jul 21, 2018

Uh oh!

ywelsch commented Jul 24, 2018

Uh oh!

backslasht commented Jul 25, 2018

Uh oh!

BobBlank12 commented Oct 3, 2018

Uh oh!

ywelsch commented Oct 4, 2018

Uh oh!

ywelsch commented Jan 15, 2019

Uh oh!

ywelsch commented Jan 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants