Skip to content

Conversation

@ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jan 17, 2019

Some tests (DedicatedClusterSnapshotRestoreIT.testRestoreIndexWithShardsMissingInLocalGateway) were split-braining since being switched to Zen2 because the bootstrap setting was left around when nodes got restarted with data folders wiped.

The test in question here was starting one node (which autobootstrapped to that single node), then another node. The first node was then shut down (after excluding it from the voting configuration), its data folder wiped, and restarted. After restart, the node had an empty data folder yet initial_master_nodes set to itself (i.e. same name). This made the node sometimes form a cluster of its own, and not rejoin the existing cluster with the other node.

See e.g. https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/278/console

@ywelsch ywelsch added >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 17, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor Author

ywelsch commented Jan 17, 2019

@elasticmachine retest this please

@ywelsch
Copy link
Contributor Author

ywelsch commented Jan 18, 2019

@elasticmachine run the gradle build tests 1

assert DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.exists(newSettings.build()) == false : "min master nodes is auto managed";
newSettings.put(DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.getKey(), minMasterNodes).build();
newSettings.put(DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.getKey(), minMasterNodes);
if (callbackSettings != null && INITIAL_MASTER_NODES_SETTING.exists(callbackSettings) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning null from onNodeStopped is a bug since it looks equivalent to returning Settings.EMPTY. The only place I can see this happening is QuorumGatewayIT in which the onNodeStopped override can be completely removed. Can we just do that and then assert it's not null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jup

@ywelsch
Copy link
Contributor Author

ywelsch commented Jan 18, 2019

@elasticmachine run the gradle build tests 1

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywelsch
Copy link
Contributor Author

ywelsch commented Jan 18, 2019

@elasticmachine run the gradle build tests 1
@elasticmachine run the gradle build tests 2

@ywelsch ywelsch merged commit 377d96e into elastic:master Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants