Skip to content

Conversation

@bleskes
Copy link
Contributor

@bleskes bleskes commented Jul 25, 2017

This predicate is used to deal with the intricacies of detecting when a master is reelected/nodes rejoins an existing master. The current implementation is based on nodeIds, which is fine if the master really change. If the nodeId is equal the code falls back to detecting an increment in the cluster state version which happens when a node is re-elected or when the node rejoins. Sadly this doesn't cover the case where the same node is elected after a full restart of all master nodes. In that case we recover the cluster state from disk but the version is reset back to 0. To fix this, the check should be done based on ephemeral IDs which are reset on restart.

Fixes #25471

…er change

This predicate is used to deal with the intricacies of detecting when a master is reelected/nodes rejoins an existing master. The current implementation is based on nodeIds, which is fine if the master really change. If the nodeId is equal the code falls back to detecting an increment in the cluster state version which happens when a node is re-elected or when the node rejoins. Sadly this doesn't cover the case where the same node is elected after a full restart of all master nodes. In that case we recover the cluster state from disk but the version is reset back to 0. To fix this, the check should be done based on ephemeral IDs which are reset on restart.
@bleskes bleskes added :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >bug v5.6.0 v6.0.0 labels Jul 25, 2017
@bleskes bleskes requested a review from ywelsch July 25, 2017 10:09
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullPointerException

public static Predicate<ClusterState> build(ClusterState currentState) {
final long currentVersion = currentState.version();
final String currentMaster = currentState.nodes().getMasterNodeId();
final String currentMasterId = currentState.nodes().getMasterNode().getEphemeralId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullPointerException if getMasterNode() returns null?

final String newMasterId = newState.nodes().getMasterNode().getEphemeralId();
final boolean accept;
if (newMaster == null) {
if (newMasterId == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newMasterId (= getEphemeralId()) cannot be null, getMasterNode() can be, however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad refactoring. yikes. will fix and check for the testing coverage.

if (newMasterId == null) {
accept = false;
} else if (newMaster.equals(currentMaster) == false){
} else if (newMasterId.equals(currentMasterId) == false){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's nicer to just capture the DiscoveryNode object here and use equals on that (internally it will use the ephemeral id for comparison anyhow).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's I had it before. I refactored it as I prefer to have explicit references to places that depends on the ephemeral id. I know we use the disco nodes themselves in many places but I don't want to add another one.

@bleskes
Copy link
Contributor Author

bleskes commented Jul 26, 2017

Thx @ywelsch . I addressed your feedback. Can you take another look please?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (after addressing 3 nits)

if (newMaster == null) {
accept = false;
} else if (newMaster.equals(currentMaster) == false){
} else if (newMaster.getEphemeralId().equals(currentMasterId) == false){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space missing between closing and opening bracket: false){


public void testDelegateToFailingMaster() throws ExecutionException, InterruptedException {
boolean failsWithConnectTransportException = randomBoolean();
boolean failsWithConnectTransportException = true || randomBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to remove true || before pushing ;-)

// reset the same state to increment a version simulating a join of an existing node
setState(clusterService, clusterService.state());
if (randomBoolean()) {
// simulate master node removal removal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removal removal removal

@bleskes bleskes merged commit 03eb146 into elastic:master Jul 26, 2017
@bleskes bleskes deleted the master_change_master_restarts branch July 26, 2017 15:03
bleskes added a commit that referenced this pull request Jul 26, 2017
…er change (#25877)

This predicate is used to deal with the intricacies of detecting when a master is reelected/nodes rejoins an existing master. The current implementation is based on nodeIds, which is fine if the master really change. If the nodeId is equal the code falls back to detecting an increment in the cluster state version which happens when a node is re-elected or when the node rejoins. Sadly this doesn't cover the case where the same node is elected after a full restart of all master nodes. In that case we recover the cluster state from disk but the version is reset back to 0. To fix this, the check should be done based on ephemeral IDs which are reset on restart.

Fixes #25471
bleskes added a commit that referenced this pull request Jul 26, 2017
…er change (#25877)

This predicate is used to deal with the intricacies of detecting when a master is reelected/nodes rejoins an existing master. The current implementation is based on nodeIds, which is fine if the master really change. If the nodeId is equal the code falls back to detecting an increment in the cluster state version which happens when a node is re-elected or when the node rejoins. Sadly this doesn't cover the case where the same node is elected after a full restart of all master nodes. In that case we recover the cluster state from disk but the version is reset back to 0. To fix this, the check should be done based on ephemeral IDs which are reset on restart.

Fixes #25471
bleskes added a commit that referenced this pull request Jul 26, 2017
…er change (#25877)

This predicate is used to deal with the intricacies of detecting when a master is reelected/nodes rejoins an existing master. The current implementation is based on nodeIds, which is fine if the master really change. If the nodeId is equal the code falls back to detecting an increment in the cluster state version which happens when a node is re-elected or when the node rejoins. Sadly this doesn't cover the case where the same node is elected after a full restart of all master nodes. In that case we recover the cluster state from disk but the version is reset back to 0. To fix this, the check should be done based on ephemeral IDs which are reset on restart.

Fixes #25471
bleskes added a commit that referenced this pull request Jul 26, 2017
…er change (#25877)

This predicate is used to deal with the intricacies of detecting when a master is reelected/nodes rejoins an existing master. The current implementation is based on nodeIds, which is fine if the master really change. If the nodeId is equal the code falls back to detecting an increment in the cluster state version which happens when a node is re-elected or when the node rejoins. Sadly this doesn't cover the case where the same node is elected after a full restart of all master nodes. In that case we recover the cluster state from disk but the version is reset back to 0. To fix this, the check should be done based on ephemeral IDs which are reset on restart.

Fixes #25471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure v5.6.0 v6.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SpecificMasterNodesIT#testSimpleOnlyMasterNodeElection failure

3 participants