Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 24, 2019

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older than any node when allocating replicas.

Relates #46959

This work was done by @henningandersen in #42518.
Co-authored-by: Henning Andersen [email protected]

@dnhatn dnhatn added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 labels Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Sep 24, 2019

Failure at [reference/cluster/health:36]: $body didn't match expected value:

This was fixed in #47016.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I am not sure my review counts on this one 😃...

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some drive-by comments

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Thanks Nhat!
Might be best for David or Yannick to look over this as well though. I understand what's going on here just fine now, but might miss some implication of this change.

@dnhatn
Copy link
Member Author

dnhatn commented Sep 27, 2019

@DaveCTurner @ywelsch This PR blocks the work in #46959. It would be great if one of you can take a look. Thank you!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I suggested a comment and a few small changes.

@dnhatn dnhatn requested a review from DaveCTurner September 27, 2019 16:48
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @dnhatn

@dnhatn
Copy link
Member Author

dnhatn commented Sep 27, 2019

Test failures are fixed in #47196.

@dnhatn
Copy link
Member Author

dnhatn commented Sep 28, 2019

@henningandersen @original-brownbear @DaveCTurner Thank you for reviewing.

@dnhatn dnhatn merged commit caaf02f into elastic:master Sep 28, 2019
@dnhatn dnhatn deleted the refetch-node-join branch September 28, 2019 02:21
dnhatn added a commit that referenced this pull request Oct 2, 2019
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.

Co-authored-by: Henning Andersen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v7.5.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants