Re-fetch shard info of primary when new node joins #47035

dnhatn · 2019-09-24T16:24:51Z

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older than any node when allocating replicas.

Relates #46959

This work was done by @henningandersen in #42518.
Co-authored-by: Henning Andersen [email protected]

elasticmachine · 2019-09-24T16:24:53Z

Pinging @elastic/es-distributed

dnhatn · 2019-09-24T21:38:20Z

Failure at [reference/cluster/health:36]: $body didn't match expected value:

This was fixed in #47016.

henningandersen

LGTM, but I am not sure my review counts on this one 😃...

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

original-brownbear

Just some drive-by comments

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

server/src/test/java/org/elasticsearch/gateway/AsyncShardFetchTests.java

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java

original-brownbear

LGTM :) Thanks Nhat!
Might be best for David or Yannick to look over this as well though. I understand what's going on here just fine now, but might miss some implication of this change.

dnhatn · 2019-09-27T02:46:23Z

@DaveCTurner @ywelsch This PR blocks the work in #46959. It would be great if one of you can take a look. Thank you!

DaveCTurner

Looks good, I suggested a comment and a few small changes.

server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

…r.java Co-Authored-By: David Turner <[email protected]>

DaveCTurner

LGTM thanks @dnhatn

dnhatn · 2019-09-27T18:04:22Z

Test failures are fixed in #47196.

dnhatn · 2019-09-28T02:20:02Z

@henningandersen @original-brownbear @DaveCTurner Thank you for reviewing.

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary. With this change, we ensure the shard info from the primary is not older than any node when allocating replicas. Relates #46959 This work was done by Henning in #42518. Co-authored-by: Henning Andersen <[email protected]>

Re-fetch shard info of primary when new node join

25598a9

dnhatn added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 labels Sep 24, 2019

dnhatn requested review from DaveCTurner, henningandersen and ywelsch September 24, 2019 16:24

dnhatn mentioned this pull request Sep 24, 2019

Sequence number based replica allocation #46959

Merged

Merge branch 'master' into refetch-node-join

22f66b9

destroy sync_id

3e04e61

henningandersen approved these changes Sep 25, 2019

View reviewed changes

server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java Show resolved Hide resolved

original-brownbear reviewed Sep 25, 2019

View reviewed changes

dnhatn added 2 commits September 25, 2019 12:03

adjust test

771582a

Merge branch 'master' into refetch-node-join

19f9199

dnhatn requested a review from original-brownbear September 25, 2019 16:43

original-brownbear approved these changes Sep 25, 2019

View reviewed changes

DaveCTurner reviewed Sep 27, 2019

View reviewed changes

dnhatn and others added 5 commits September 27, 2019 10:07

Update server/src/main/java/org/elasticsearch/gateway/GatewayAllocato…

256c317

…r.java Co-Authored-By: David Turner <[email protected]>

Update server/src/main/java/org/elasticsearch/gateway/GatewayAllocato…

f5c3769

…r.java Co-Authored-By: David Turner <[email protected]>

david’s comments

e832dc7

more feedback

205ee86

remove unneeded comment

54589df

dnhatn requested a review from DaveCTurner September 27, 2019 16:48

DaveCTurner approved these changes Sep 27, 2019

View reviewed changes

Merge branch 'master' into refetch-node-join

6371c38

dnhatn merged commit caaf02f into elastic:master Sep 28, 2019

dnhatn deleted the refetch-node-join branch September 28, 2019 02:21

dnhatn added backport pending and removed backport pending labels Sep 28, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Re-fetch shard info of primary when new node joins #47035

Re-fetch shard info of primary when new node joins #47035

Uh oh!

Conversation

dnhatn commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Sep 24, 2019

Uh oh!

dnhatn commented Sep 24, 2019

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Sep 27, 2019

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Sep 27, 2019

Uh oh!

dnhatn commented Sep 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dnhatn commented Sep 24, 2019 •

edited

Loading