-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Allow cluster access during node restart #42946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow cluster access during node restart #42946
Conversation
This commit modifies InternalTestCluster to allow using client() and other operations inside a RestartCallback (onStoppedNode typically). It goes in a direction of most methods returning the state as if the restarting node did not exist, avoiding various exceptions stemming from accessing the stopped node(s).
|
Pinging @elastic/es-distributed |
When explicitly asking for an instance from a node, return it also from closed nodes.
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a suggestion on how we could possibly do this in a simpler way
|
|
||
| private synchronized NodeAndClient getRandomNodeAndClient(Predicate<NodeAndClient> predicate) { | ||
| private NodeAndClient getRandomNodeAndClient(Predicate<NodeAndClient> predicate) { | ||
| return getRandomNodeAndClientIncludingClosed(((Predicate<NodeAndClient>) nc -> nc.isClosed() == false).and(predicate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid the cast by writing this as predicate.and(nc -> nc.isClosed() == false)?
| } | ||
| } | ||
|
|
||
| public boolean isClosed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS you're using isClosed() == false everywhere. perhaps turn this into an isOpen method, and then just use NodeAndClient::isOpen as a predicate.
| expectedNodes.add(getInstanceFromNode(ClusterService.class, nodeAndClient.node()).localNode()); | ||
| } | ||
| Set<DiscoveryNode> expectedNodes = | ||
| nodes.values().stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we temporarily remove the node from the nodes map when it is restarted during the close? This could possibly simplify things here, not requiring us to exclude closed nodes everywhere. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did not do so due to test failures. In hindsight, I think I agree, it will leave this in a much cleaner state. Will give it a go and see if I can fix the tests that needs fixing (at least GatewayIndexStateIT.testArchiveBrokenClusterSettings since it calls getInstance for a stopped node).
…ork_disregard_nodes_during_restart
Changed strategy to simply remove the nodes from the nodes map during restart. This simplifies this and cements that you cannot get a hold of a restarting node in onStopped. Fixed the 3 violating cases to comply with that.
|
@elasticmachine update branch |
|
Thanks for reviewing @ywelsch , this is now much simpler and ready for another round. |
ywelsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
| } | ||
|
|
||
| if (activeDisruptionScheme != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this? Is it a NOOP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now call publishNode, which both adds to the nodes map and applies active disruption scheme.
This commit modifies InternalTestCluster to allow using client() and other operations inside a RestartCallback (onStoppedNode typically). Restarting nodes are now removed from the map and thus all methods now return the state as if the restarting node does not exist. This avoids various exceptions stemming from accessing the stopped node(s).
This commit modifies InternalTestCluster to allow using client() and other operations inside a RestartCallback (onStoppedNode typically). Restarting nodes are now removed from the map and thus all methods now return the state as if the restarting node does not exist. This avoids various exceptions stemming from accessing the stopped node(s).
This commit modifies InternalTestCluster to allow using client() and
other operations inside a RestartCallback (onStoppedNode typically).
Restarting nodes are now removed from the map and thus all
methods now return the state as if the restarting node does not exist.
This avoids various exceptions stemming from accessing the stopped
node(s).
Part of #42518