Skip to content

Conversation

@original-brownbear
Copy link
Contributor

Deduplicate shard started requests the same way we deduplicate shard-failed
and shard snapshot state updates already.

closes #81628

Deduplicate shard started requests the same way we deduplicate shard-failed
and shard snapshot state updates already.

closes elastic#81628
@original-brownbear original-brownbear added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v8.1.0 labels Dec 27, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test demonstrating that this works, both when master has stuff queued, causing the retries and when master fails over?

I wonder if we should change our logic here to always send to a new master to speed up recovery after a very slow/gc hung master is taken over by another master?

// a list of shards that failed during replication
// we keep track of these shards in order to avoid sending duplicate failed shard requests for a single failing shard.
private final ResultDeduplicator<FailedShardEntry, Void> remoteFailedShardsDeduplicator = new ResultDeduplicator<>();
private final ResultDeduplicator<TransportRequest, Void> remoteFailedShardsDeduplicator = new ResultDeduplicator<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this field needs a rename now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ renamed and fixed comment

@original-brownbear
Copy link
Contributor Author

Can we add a test demonstrating that this works, both when master has stuff queued, causing the retries and when master fails over?

I added a rather trivial test in the style of the tests that already exist for this thing (the test is good enough to demonstrate proper deduplication of requests IMO). Couldn't find a quick way of testing the thing below.

I wonder if we should change our logic here to always send to a new master to speed up recovery after a very slow/gc hung master is taken over by another master?

Yea we had the same issue in shard snapshot state updates and I implemented the same solution now. Unfortunately I couldn't find a neat way of testing this quickly. In snapshotting this is a lot easier to test with the existing test infrastructure.
Not sure it's worth the effort to add a test for this today since it's certainly better than it was before by adding the clearing out of the dedup?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with two small additions.

* to the new master right away on master failover.
*/
public void clearRemoteShardRequestDeduplicator() {
remoteShardStateUpdateDeduplicator.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we call this from multiple threads, there is a bit of best-effort over this method, I think that is worth documenting.

For instance, this may clear out a remote shard failed request deduplication to the new master in edge cases. This does no real harm, since we still protect the master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ added

assertThat(transport.capturedRequests(), arrayWithSize(0));
}

public void testDeduplicateRemoteShardStarted() throws InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either add a test or randomly clear the deduplicator here and then validate we see two requests at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ done

@original-brownbear
Copy link
Contributor Author

Thanks Henning!

@original-brownbear original-brownbear merged commit 01debdc into elastic:master Dec 28, 2021
@original-brownbear original-brownbear deleted the 81628 branch December 28, 2021 18:02
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 28, 2021
Deduplicate shard started requests the same way we deduplicate shard-failed
and shard snapshot state updates already.

closes elastic#81628
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.0

elasticsearchmachine pushed a commit that referenced this pull request Dec 28, 2021
Deduplicate shard started requests the same way we deduplicate shard-failed
and shard snapshot state updates already.

closes #81628
@DaveCTurner
Copy link
Contributor

LGTM as a small/interim fix, but fundamentally we should be using edge-triggering for the shard state transitions with appropriate failure handling to organise retries (see also #81626). Today’s level-triggered system was necessary when cluster state updates could be lost I guess, but that’s no longer the case. I opened #82185 to track this tech debt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.0.0 v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stop unnecessary retries of shard-started tasks

5 participants