-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Update IndexShardSnapshotStatus when an exception is encountered #32265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e snapshot call is made When snapshot is about to start for a shard which is in one of the following states, currently we don't move the snapshot status to failed, adding code to move the status to failed. i) Not primary in the current node ii) Relocating iii) Recovering iv) Index hasn't loaded yet in indices service Also adding ABORTED status in the IndexShard Status/Stats/Stage.
|
Pinging @elastic/es-distributed |
|
Thanks for reporting this. I've looked more closely into the series of events leading to this situation: When a node leaves the cluster, the master will fail all shards allocated to that node. SnapshotsService on the master (which is a ClusterStateApplier) will get the updated cluster state with the removed node and call processSnapshotsOnRemovedNodes, which in turn will submit a cluster state update task to move the snapshot from STARTED/ABORTED to FAILED. The first thing we'll need to do here is to write an integration test that reproduces the issue. Regarding a fix, I would prefer to have a SnapshotsInProgress object that's fully in-sync with the routing table, similar to what I have done here for the RestoreInProgress information, and then build a solution on top of that. I'll explore this further in the next days, just wanted to give you an update here. |
|
Thanks for looking into the change. In addition to the IndexNotFoundException, I have also found that we are not updating IndexShardSnapshotStatus in cases like:
I have fixed it as part of this PR. I have also added missing ABORTED status in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats. Should I raise a separate a PR for them? |
|
For people struggling with this issue, should we consider temporarily disabling shard allocation before snapshots and enabling it afterwards? |
|
@BobBlank12 that will unfortunately not help. I've worked on a proper fix (which requires rewriting some core parts of the snapshotting code), but have to break this up now into smaller reviewable pieces. There is unfortunately no workaround for now. If you hit this issue, you can manually solve the problem by following the procedure outlined here: #31624 (comment) |
* Bringing in cluster state test infrastructure * Relates elastic#32265
* Use `DeterministicTaskQueue` infrastructure to reproduce elastic#32265
|
We have a quick fix in #36113 and a more comprehensive fix will follow. |
|
@backslasht can you verify that #36113 fixes the issue? |
* Fixes two broken spots:
1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting
2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state
* Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk
* Significantly extended test infrastructure to reproduce the above two issues
* Two new test runs:
1. Reproducing the effects of node disconnects/restarts in isolation
2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
* Relates #32265
* Closes #32348
Background
We have identified an issue in latest elastic search snapshot code where the snapshot is stuck (making no progress and not get deleted) when one or more shards’ (whose snapshot state is in INIT/STARTED state) are not worked on by the node it is assigned to. This could happen when primary node is different (changed after the snapshot is started) from the node (old primary) to which the shard is marked to be snapshot-ed.
When does it happen
When one of the data nodes having primary shards is restarted (process restart) while the snapshot is running and joins back the cluster within 30 seconds. The node upon restart fails to process the cluster update (due to a race condition between the snapshot service and indices service) and all the shards for which the node was primary (before restart) and in INIT or STARTED state (snapshot state) will be stuck in that state forever.
The shards get stuck as the indices services throws a IndexNotFoundException as it hasn't processed the cluster update yet. And if one of the shards (say x out of y shards that need to snapshot-ed) receives the IndexNotFoundException, SnapshotShardsService fails to queue the snapshot thread for that shard as well for all the following shards (y - x) + 1. The master will keep on waiting for the shard to go to logical end state (DONE or FAILED) and report it back to the master. But since the snapshot thread didn't start for the shard, it will never report back the state and thus snapshot stuck indefinitely.
When a delete call is invoked on the snapshot which is stuck, all the shards which is in INIT or STARTED state will be marked as ABORTED expecting the BlobStoreRepository to throw an exception and move the shard to logically end state FAILED. But since no thread is working on these shards, it will remain as ABORTED and new subsequent delete call will be queued resulting in increase in number of tasks.
Proposed Fix
When IndexNotFoundException is received by the SnapshotShardsService, catch the exception and immediately mark the snapshot shard state to FAILED. After marking it as FAILED, it can still go ahead and process the rest of the shards which will eventually make the snapshot to go into PARTIAL state instead of stuck forever.
When a snapshot delete call is made for a snapshot which is in progress (or stuck), SnaphotShardsService iterate through the shards of the snapshot and marks them as DONE or FAILED if it is complete or error-ed respectively. For the shards which are in INIT or STARTED state, it is marked as ABORTED expecting the thread which is uploading the data to detect the ABORTED state and throw an error, thus going to a logical FAILED state. But, since there are no threads working on these shards (lost during restart), the state remains the same forever. To fix that, while marking the shard status as ABORTED we can do an additional check to see if the shard’s current primary is different than the node to which the shard is marked to be snapshot-ed. If they are different, we can fail the shard immediately thus making all shards to reach a logical state (either DONE or FAILED) which in turn result in successful deletion of the problematic snapshot.
Adding ABORTED state in SnapshotIndexShardStage, SnapshotIndexShardStatus and SnapshotShardsStats where it is missing.
Steps to reproduce (100% success)
Related Issues