-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add an allocation decider to prevent allocating shards to nodes which are preparing for shutdown #71658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dakrone This might need some tweaks if I've forgotten a behavior we intended for shard handling, or if I've misunderstood how allocation deciders work in some way. If there's a big issue or you think it would be better to have a sync conversation let me know. Might also add some unit tests if you think they'd be valuable. |
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
|
Pinging @elastic/es-distributed (Team:Distributed) |
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Gordon, I left a couple of minor comments
Might also add some unit tests if you think they'd be valuable.
I think this would be valuable, it's always nice to be able to run ./gradlew test and see changes much sooner than having to run integration tests, so I'm in favor of adding them (and I think writing unit tests forces us to write more unit-testable code).
| addAllocationDecider(deciders, new ThrottlingAllocationDecider(settings, clusterSettings)); | ||
| addAllocationDecider(deciders, new ShardsLimitAllocationDecider(settings, clusterSettings)); | ||
| addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings)); | ||
| addAllocationDecider(deciders, new NodeShutdownAllocationDecider()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These deciders are actually loosely in order of performance, because regular decision making short-circuits to avoid doing the processing if it's not needed.
So I think should should be moved up to right after the RestoreInProgressAllocationDecider, because it's fairly simple and no need to calculate filtering rules if the node can't have any data because it's being shut down anyway.
| return allocation.decision(Decision.NO, NAME, "node [%s] is preparing to be removed from the cluster", node.nodeId()); | ||
| } | ||
|
|
||
| return Decision.YES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to be a descriptive YES decision that the node is being shut down, but it's of RESTART type (or at least, not REMOVE type)?
|
|
||
| if (thisNodeShutdownMetadata == null) { | ||
| return allocation.decision(Decision.YES, NAME, "node [%s] is not preparing for removal from the cluster"); | ||
| } else if (SingleNodeShutdownMetadata.Type.RESTART.equals(thisNodeShutdownMetadata.getType())){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it'd be clearer to have it in a switch statement on the type with a default that always threw an exception rather than a series of ifs, but this is mostly personal preference
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM in general, but I did leave some questions, curious what your thoughts are on them!
| node.nodeId() | ||
| ); | ||
| default: | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ever get into this state, I think this is going to spam the logs like crazy and be unsilenceable (because it's at the error level). I mean, hopefully we don't get into this state, but I think I'd prefer to just have the assert fail our builds. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I didn't think about how spammy it would be. I do think that a way to identify that we're hitting this condition in the field besides looking for a YES decision with no explanation is good (what can I say, I'm paranoid), but I'll drop it down to DEBUG so at least it won't spam by default. Sound good?
| return allocation.decision(Decision.YES, NAME, "node [%s] is preparing to restart, but will remain in the cluster", | ||
| node.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know a node is going to be restarted, should we actually expand replicas to the node? I'm trying to think of a scenario where we would need to expand replicas for safety if we knew the node was going to be restarted, but I haven't thought of any, right now it seems to make more sense to keep the data off if we know the node will be going away. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense - my thinking was that if a node is restarting, we can allow the replica expansion and it won't matter as we won't reassign the shard (once we implement that bit, anyway). I think we could go either way on this, but until/unless we find a case where it's beneficial to expand while the node is still shutting down, I'll change it to a NO.
| case REMOVE: | ||
| return allocation.decision(Decision.NO, NAME, "node [%s] is preparing for removal from the cluster", node.getId()); | ||
| default: | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as above about the log spam, maybe we should just go with only the assert here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same response here, I'll drop it to DEBUG.
… are preparing for shutdown (elastic#71658) This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster. Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.
… are preparing for shutdown (#71658) (#72587) This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster. Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.
This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.
Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.
Relates #70338.