Skip to content

Conversation

@gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Apr 13, 2021

This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.

Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.

Relates #70338.

@gwbrown gwbrown added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.13.0 :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown labels Apr 13, 2021
@gwbrown gwbrown requested a review from dakrone April 22, 2021 23:07
@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 22, 2021

@dakrone This might need some tweaks if I've forgotten a behavior we intended for shard handling, or if I've misunderstood how allocation deciders work in some way. If there's a big issue or you think it would be better to have a sync conversation let me know.

Might also add some unit tests if you think they'd be valuable.

@gwbrown gwbrown marked this pull request as ready for review April 23, 2021 22:56
@elasticmachine elasticmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Apr 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Gordon, I left a couple of minor comments

Might also add some unit tests if you think they'd be valuable.

I think this would be valuable, it's always nice to be able to run ./gradlew test and see changes much sooner than having to run integration tests, so I'm in favor of adding them (and I think writing unit tests forces us to write more unit-testable code).

addAllocationDecider(deciders, new ThrottlingAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new ShardsLimitAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new NodeShutdownAllocationDecider());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These deciders are actually loosely in order of performance, because regular decision making short-circuits to avoid doing the processing if it's not needed.

So I think should should be moved up to right after the RestoreInProgressAllocationDecider, because it's fairly simple and no need to calculate filtering rules if the node can't have any data because it's being shut down anyway.

return allocation.decision(Decision.NO, NAME, "node [%s] is preparing to be removed from the cluster", node.nodeId());
}

return Decision.YES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to be a descriptive YES decision that the node is being shut down, but it's of RESTART type (or at least, not REMOVE type)?


if (thisNodeShutdownMetadata == null) {
return allocation.decision(Decision.YES, NAME, "node [%s] is not preparing for removal from the cluster");
} else if (SingleNodeShutdownMetadata.Type.RESTART.equals(thisNodeShutdownMetadata.getType())){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it'd be clearer to have it in a switch statement on the type with a default that always threw an exception rather than a series of ifs, but this is mostly personal preference

@gwbrown gwbrown requested a review from dakrone April 27, 2021 15:04
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM in general, but I did leave some questions, curious what your thoughts are on them!

node.nodeId()
);
default:
logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever get into this state, I think this is going to spam the logs like crazy and be unsilenceable (because it's at the error level). I mean, hopefully we don't get into this state, but I think I'd prefer to just have the assert fail our builds. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I didn't think about how spammy it would be. I do think that a way to identify that we're hitting this condition in the field besides looking for a YES decision with no explanation is good (what can I say, I'm paranoid), but I'll drop it down to DEBUG so at least it won't spam by default. Sound good?

Comment on lines 95 to 96
return allocation.decision(Decision.YES, NAME, "node [%s] is preparing to restart, but will remain in the cluster",
node.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know a node is going to be restarted, should we actually expand replicas to the node? I'm trying to think of a scenario where we would need to expand replicas for safety if we knew the node was going to be restarted, but I haven't thought of any, right now it seems to make more sense to keep the data off if we know the node will be going away. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense - my thinking was that if a node is restarting, we can allow the replica expansion and it won't matter as we won't reassign the shard (once we implement that bit, anyway). I think we could go either way on this, but until/unless we find a case where it's beneficial to expand while the node is still shutting down, I'll change it to a NO.

case REMOVE:
return allocation.decision(Decision.NO, NAME, "node [%s] is preparing for removal from the cluster", node.getId());
default:
logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above about the log spam, maybe we should just go with only the assert here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response here, I'll drop it to DEBUG.

@gwbrown gwbrown merged commit f0c227d into elastic:master Apr 30, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Apr 30, 2021
… are preparing for shutdown (elastic#71658)

This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.

Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.
gwbrown added a commit that referenced this pull request Apr 30, 2021
… are preparing for shutdown (#71658) (#72587)

This PR adds an allocation decider which uses the metadata managed by the Node Shutdown API to prevent shards from being allocated to nodes which are preparing to be removed from the cluster.

Additionally, shards will not be auto-expanded to nodes which are preparing to restart, instead waiting until after the restart is complete to expand the shard replication.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.14.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants