Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

This change ensures that follower nodes periodically check that their leader is
healthy, and that they elect a new leader if not.

This change ensures that follower nodes periodically check that their leader is
healthy, and that they elect a new leader if not.
@DaveCTurner DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Sep 25, 2018
@DaveCTurner DaveCTurner requested a review from ywelsch September 25, 2018 13:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

switch (getConnectionStatus(getLocalNode(), destination)) {
case BLACK_HOLE:
logger.trace("dropping {}", requestDescription);
if (action.equals(HANDSHAKE_ACTION_NAME)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me unhappy, but I was unable to come up with a better idea. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a7a76c0 which refactors the class a bit to make it more extensible. Let me know if you like that better.

@ywelsch ywelsch mentioned this pull request Sep 25, 2018
61 tasks
if (leaderCheckScheduler != null) {
leaderCheckScheduler.close();
}
leaderCheckScheduler = leaderChecker.startLeaderChecker(leaderNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Does this mean we restart the leader checker on every incoming publication? We call becomeFollower on every incoming publication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did mean a new leader checker on each publication. I pushed be15266 not to do this. However I can't think of a good way to reliably assert that we don't do this: both ways have pretty much the right liveness properties, and my other two ideas are:

  • check we don't send another leader check immediately after each publication (not robust)
  • check for reference equality of the leaderCheckScheduler object before/after a second publication (don't fancy exposing this).

Copy link
Contributor

@ywelsch ywelsch Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I can't think of a good way to reliably assert that we don't do this

I don't have any good idea here. I'll keep thinking about this. Should not block this PR though.

switch (getConnectionStatus(getLocalNode(), destination)) {
case BLACK_HOLE:
logger.trace("dropping {}", requestDescription);
if (action.equals(HANDSHAKE_ACTION_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a7a76c0 which refactors the class a bit to make it more extensible. Let me know if you like that better.

@DaveCTurner
Copy link
Contributor Author

Thanks for a7a76c0 (for some reason I can't reply to this comment inline). I moved the log statement in 98a8236 but otherwise LGTM.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit d995fc8 into elastic:zen2 Sep 26, 2018
@DaveCTurner DaveCTurner deleted the 2018-09-25-integrate-leader-checker branch September 26, 2018 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants