Add an allocation decider to prevent allocating shards to nodes which are preparing for shutdown #71658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

gwbrown merged 18 commits into elastic:master from gwbrown:decom/shard-management

Apr 30, 2021

Contributor

gwbrown commented Apr 13, 2021 •

edited

Loading

This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.

Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.

Relates #70338.


          First cut of a node shutdown allocation decider

4b23f4b

gwbrown added >non-issue :Distributed Coordination/Allocation v8.0.0 v7.13.0 :Core/Infra/Node Lifecycle labels

gwbrown added 8 commits

April 19, 2021 13:42


          Merge branch 'master' into decom/shard-management

bced690


          Fix compilation after master merge

23b8eff


          Merge branch 'master' into decom/shard-management

fa55a7a


          Actually register the allocation decider

2c96035


          Don't prevent shards from being allocated to restarting nodes.

d13ba36


          Adjust message wording

c154d0b


          Add integration test


          Merge branch 'master' into decom/shard-management

fc02e54

gwbrown requested a review from dakrone

April 22, 2021 23:07

Contributor Author

gwbrown commented Apr 22, 2021

@dakrone This might need some tweaks if I've forgotten a behavior we intended for shard handling, or if I've misunderstood how allocation deciders work in some way. If there's a big issue or you think it would be better to have a sync conversation let me know.

Might also add some unit tests if you think they'd be valuable.

gwbrown mentioned this pull request

Add node shutdown API for shutting down nodes cleanly #70338

Closed

22 tasks


          Imports

28107d8

gwbrown marked this pull request as ready for review

April 23, 2021 22:56

elasticmachine added Team:Core/Infra Team:Distributed (Obsolete) labels

Collaborator

elasticmachine commented Apr 23, 2021

Pinging @elastic/es-core-infra (Team:Core/Infra)

Collaborator

elasticmachine commented Apr 23, 2021

Pinging @elastic/es-distributed (Team:Distributed)

dakrone requested changes

View reviewed changes

Member

dakrone left a comment

Thanks for working on this Gordon, I left a couple of minor comments

Might also add some unit tests if you think they'd be valuable.

I think this would be valuable, it's always nice to be able to run ./gradlew test and see changes much sooner than having to run integration tests, so I'm in favor of adding them (and I think writing unit tests forces us to write more unit-testable code).

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java Outdated

    
                      addAllocationDecider(deciders, new ThrottlingAllocationDecider(settings, clusterSettings));

                      addAllocationDecider(deciders, new ShardsLimitAllocationDecider(settings, clusterSettings));

                      addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings));

                      addAllocationDecider(deciders, new NodeShutdownAllocationDecider());

Member

dakrone Apr 23, 2021

These deciders are actually loosely in order of performance, because regular decision making short-circuits to avoid doing the processing if it's not needed.

So I think should should be moved up to right after the RestoreInProgressAllocationDecider, because it's fairly simple and no need to calculate filtering rules if the node can't have any data because it's being shut down anyway.

...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java Outdated

    
                          return allocation.decision(Decision.NO, NAME, "node [%s] is preparing to be removed from the cluster", node.nodeId());

                      }

                      return Decision.YES;

Member

dakrone Apr 23, 2021

Can you change this to be a descriptive YES decision that the node is being shut down, but it's of RESTART type (or at least, not REMOVE type)?

...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java Outdated

    
                      if (thisNodeShutdownMetadata == null) {

                          return allocation.decision(Decision.YES, NAME, "node [%s] is not preparing for removal from the cluster");

                      } else if (SingleNodeShutdownMetadata.Type.RESTART.equals(thisNodeShutdownMetadata.getType())){

Member

dakrone Apr 23, 2021

This seems like it'd be clearer to have it in a switch statement on the type with a default that always threw an exception rather than a series of ifs, but this is mostly personal preference

gwbrown added 4 commits

April 26, 2021 16:17


          Merge branch 'master' into decom/shard-management

a6fc89c


          Unit tests and minor cleanup per review

4090b78


          Move shutdown allocation decider up in priority

eb30c3d


          Fix decider order test

0e6030a

gwbrown requested a review from dakrone

April 27, 2021 15:04

dakrone approved these changes

View reviewed changes

Member

dakrone left a comment

This LGTM in general, but I did leave some questions, curious what your thoughts are on them!

...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java Outdated

    
                                  node.nodeId()

                              );

                          default:

                              logger.error(

Member

dakrone Apr 28, 2021

If we ever get into this state, I think this is going to spam the logs like crazy and be unsilenceable (because it's at the error level). I mean, hopefully we don't get into this state, but I think I'd prefer to just have the assert fail our builds. What do you think?

Contributor Author

gwbrown Apr 28, 2021

You're right, I didn't think about how spammy it would be. I do think that a way to identify that we're hitting this condition in the field besides looking for a YES decision with no explanation is good (what can I say, I'm paranoid), but I'll drop it down to DEBUG so at least it won't spam by default. Sound good?

...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java Outdated

Comment on lines 95 to 96

    
                              return allocation.decision(Decision.YES, NAME, "node [%s] is preparing to restart, but will remain in the cluster",

                                  node.getId());

Member

dakrone Apr 28, 2021

If we know a node is going to be restarted, should we actually expand replicas to the node? I'm trying to think of a scenario where we would need to expand replicas for safety if we knew the node was going to be restarted, but I haven't thought of any, right now it seems to make more sense to keep the data off if we know the node will be going away. What do you think?

Contributor Author

gwbrown Apr 28, 2021

That does make sense - my thinking was that if a node is restarting, we can allow the replica expansion and it won't matter as we won't reassign the shard (once we implement that bit, anyway). I think we could go either way on this, but until/unless we find a case where it's beneficial to expand while the node is still shutting down, I'll change it to a NO.

...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java Outdated

    
                          case REMOVE:

                              return allocation.decision(Decision.NO, NAME, "node [%s] is preparing for removal from the cluster", node.getId());

                          default:

                              logger.error(

Member

dakrone Apr 28, 2021

Same comment here as above about the log spam, maybe we should just go with only the assert here?

Contributor Author

gwbrown Apr 28, 2021

Same response here, I'll drop it to DEBUG.

gwbrown added 4 commits

April 28, 2021 16:22


          Change log messages to DEBUG to prevent log spam

b7ec81d


          Wait to auto-expand when node is restarting

b479b42


          Merge branch 'master' into decom/shard-management

d4604c2


          Fix compilation after merge

e804d74

gwbrown merged commit f0c227d into elastic:master

gwbrown added backport pending v7.14.0 and removed v7.13.0 labels

gwbrown mentioned this pull request

[7.x] Add an allocation decider to prevent allocating shards to nodes which are preparing for shutdown (#71658) #72587

Merged

gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request


          Add an allocation decider to prevent allocating shards to nodes which…

c85d7dd

… are preparing for shutdown (elastic#71658)

This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.

Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.

gwbrown added a commit that referenced this pull request


          Add an allocation decider to prevent allocating shards to nodes which…

50fbfd5

… are preparing for shutdown (#71658) (#72587)

This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.

Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.

gwbrown removed the backport pending label

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Node Lifecycle :Distributed Coordination/Allocation >non-issue Team:Core/Infra Team:Distributed (Obsolete) v7.14.0 v8.0.0-alpha1