Skip to content

Conversation

@ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Mar 14, 2019

Switches "discovery.type: single-node" from using a separate implementation for single-node discovery to using the existing standard discovery implementation, with two small adaptions:

  • auto-bootstrapping, but requiring initial_master_nodes not to be set.
  • not actively pinging other nodes using the Peerfinder
  • not allowing other nodes to join its single-node cluster (if they have e.g. been set up using regular discovery and connect to the single-disco node).

@ywelsch ywelsch added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.2.0 labels Mar 14, 2019
@ywelsch ywelsch requested a review from DaveCTurner March 14, 2019 08:54
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can improve our handling of the case where this node used to be part of a larger cluster and has since been restarted in single-node mode, in which it will no longer be able to form a cluster. For instance, the Coordinator could check this in doStart() and throw an exception if it'll never obtain a quorum.

I think we should also refuse to start if node.master: false.

I think both of the above need a mention in the breaking changes docs.

I would also like to reject PeersRequests from other nodes; today we respond normally and this means that other nodes' ClusterFormationFailureHelpers will report that they have "discovered" this node (in amongst all the exceptions about being unable to join it). (edit - see below)

logger.trace("handleJoinRequest: as {}, handling {}", mode, joinRequest);

if (singleNodeDiscovery && joinRequest.getSourceNode().equals(getLocalNode()) == false) {
joinCallback.onFailure(new IllegalStateException("cannot join node with single-node discovery"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest cannot join node with [discovery.type] set to [single-node] so that it's clearer what setting to look for.

@DaveCTurner
Copy link
Contributor

I would also like to reject PeersRequests from other nodes; today we respond normally and this means that other nodes' ClusterFormationFailureHelpers will report that they have "discovered" this node (in amongst all the exceptions about being unable to join it).

Rejecting PeersRequests isn't sufficient for this: if another node has our transport address then it'll connect to us, handshake, and add us to its known peers list without any further checks. I think we'd need to do something like only exposing known peers once they've responded to a PeersRequest.

@ywelsch
Copy link
Contributor Author

ywelsch commented Mar 20, 2019

I think we can improve our handling of the case where this node used to be part of a larger cluster and has since been restarted in single-node mode, in which it will no longer be able to form a cluster. For instance, the Coordinator could check this in doStart() and throw an exception if it'll never obtain a quorum.

It's an extra check in Coordinator for a super-edge case. I'm not sure how much that buys us (additional code to maintain vs slightly better error reporting), so just want to ask again if you think this is really worth pursuing.

I think we should also refuse to start if node.master: false.

Again, extra check for an edge case, and covered by ClusterFormationFailureHelper.

I think both of the above need a mention in the breaking changes docs.

I disagree and think this should be treated as a bug fix.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Mar 20, 2019

It's an extra check in Coordinator for a super-edge case. I'm not sure how much that buys us (additional code to maintain vs slightly better error reporting), so just want to ask again if you think this is really worth pursuing

Yes, I think the improvement in the error reporting is worth it. Bootstrap checks continue to cause confusion, and if someone tries to disable the bootstrap checks by moving to single-node discovery after the cluster has formed in development mode then I think that keeping the node running despite a hopeless situation is going to add to the confusion.

Again, extra check for an edge case, and covered by ClusterFormationFailureHelper.

Similarly, keeping the node running in a hopeless situation is confusing. The log message would say master not discovered yet: discovery will continue using [...] from hosts providers... which is not true.

@ywelsch ywelsch requested a review from DaveCTurner March 26, 2019 16:10
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywelsch ywelsch merged commit 730dad6 into elastic:master Mar 27, 2019
ywelsch added a commit that referenced this pull request Mar 27, 2019
Switches "discovery.type: single-node" from using a separate implementation for single-node discovery to using the existing standard discovery implementation, with two small adaptions:

-  auto-bootstrapping, but requiring initial_master_nodes not to be set.
- not actively pinging other nodes using the Peerfinder
- not allowing other nodes to join its single-node cluster (if they have e.g. been set up using regular discovery and connect to the single-disco node).
ywelsch added a commit that referenced this pull request Mar 27, 2019
Switches "discovery.type: single-node" from using a separate implementation for single-node discovery to using the existing standard discovery implementation, with two small adaptions:

-  auto-bootstrapping, but requiring initial_master_nodes not to be set.
- not actively pinging other nodes using the Peerfinder
- not allowing other nodes to join its single-node cluster (if they have e.g. been set up using regular discovery and connect to the single-disco node).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-rc2 v7.2.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants