Skip to content

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Oct 19, 2021

If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, meaning that it's
possible that the available permit is granted to indexRecoveredFromSnapshot1
instead of to indexRecoveredFromSnapshot2.

Relates #79316
Closes #79420

If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, causing a race
condition.
@fcofdez fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.16.0 labels Oct 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)


targetMockTransportService.clearAllRules();
channelRef.get().sendResponse(new IOException("unable to clean files"));
assertAcked(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to find a way to ensure that the RecoveryTarget reference is released and therefore the snapshot file download permit is released but since it happens asynchronously I couldn't find a reliable way to be sure that the permit has been released. Maybe we should add a Thread.sleep here? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set index.allocation.max_retries: 1 rather than adding this filter? That way we can be sure that it's the failure that releases the permits and not the fact that the allocation filter causes allocation to be cancelled.

In terms of waiting for the permits to be released, maybe add a package-private method that exposes the RecoverySettings on the PeerRecoveryTargetService and then after updating the allocation filter you can assertBusy that all the permits can be acquired.

@fcofdez fcofdez requested a review from DaveCTurner October 19, 2021 06:04
@danhermann danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021
@fcofdez
Copy link
Contributor Author

fcofdez commented Oct 27, 2021

@DaveCTurner would you mind taking a look into this when you have the chance? thanks!

@fcofdez
Copy link
Contributor Author

fcofdez commented Oct 28, 2021

@elasticmachine update branch

@fcofdez
Copy link
Contributor Author

fcofdez commented Oct 28, 2021

@elasticmachine run elasticsearch-ci/part-2
Unrelated failure

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with a couple of nits)

Comment on lines 977 to 979
Releasable snapshotDownloadPermit = peerRecoveryTargetService.tryAcquireSnapshotDownloadPermits();
assertThat(snapshotDownloadPermit, is(notNullValue()));
snapshotDownloadPermit.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for using a try-with-resources here.

.put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), "1s")
.put("index.routing.allocation.require._name", dataNodes.get(0))
.put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0)
.put("index.allocation.max_retries", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for using the setting directly rather than its literal name:

Suggested change
.put("index.allocation.max_retries", 0)
.put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0)

@fcofdez fcofdez merged commit 5cb5d92 into elastic:master Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SnapshotBasedIndexRecoveryIT testRecoveryUsingSnapshotsPermitIsReturnedAfterFailureOrCancellation failing

4 participants