-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
A common need when administrating nodes is to remove one or more nodes from the cluster without interrupting any operations. I think it would be useful to have a dedicate API for doing this decommission that is smarter than our decommissioning process.
Currently to decommission a node, a user would set the "cluster.routing.allocation.exclude._id" parameter to the node they wanted to remove from the cluster (alternatively _ip, _host, or _name could be used). They would then have to wait for the shards to be moved off of the node (checking the /_cat/allocation or similar APIs).
The issue with this is that during this time, it's possible for mutually exclusive settings to be set that causes data not to be migrated from the node, or for other filtering settings not to be achievable since the node will be subsequently removed from the cluster. Additionally, there isn't a simple way to check readiness other than reading cat APIs (should not be relied on) or the cluster state routing table.
What would be nice is the ability to mark a node as "pending decommission" such as:
PUT /_cluster/allocation/decommission
{
"nodes": ["node-id-1", "node-id-2"]
}This API would act similarly to the allocation exclusion API. We could then also treat these node ids "specially" internally, for instance we could disallow setting index-level shard filtering directly to these node IDs (this would be very nice for ILM since we randomly pick a node for allocation during a shrink). We could also potentially say "when the master node fails, prefer not to elect this node because it is going to be shortly removed from the cluster".
Rather than having a user have to check our other allocation APIs for the status of the decommission, we could provide a way to check the status of decommissioning nodes:
GET /_cluster/allocation/status
Which could give status like:
{
"nodes": [
{
"node": "data-node-1",
"node_id": "node-id-1",
"safe_to_shutdown": false,
"data_loss_on_removal": false,
"decommission_possible": false,
"shards_remaining": 7,
"time_since_decommission": "1.2h",
"time_since_decommission_millis": 4320000
},
{
"node": "data-node-2",
"node_id": "node-id-2",
"safe_to_shutdown": true,
"data_loss_on_removal": false,
"decommission_possible": true,
"shards_remaining": 0,
"time_since_decommission": "1.2h",
"time_since_decommission_millis": 4320000
}
]
}These fields are all made up, there's just ideas, but here's how they'd work:
safe_to_shutdown- whether the node contains any shards, if it holds 0, this would be truedata_loss_on_removal- in some cases (2 node clusters) you cannot evacuate all the data because primary and replicas can't be on the same node, this could return "true" if there were any 0-replica indices present on the node (so you could still remove the node if you didn't mind your cluster going yellow)decommission_possible- internally ES could check whether the shards could even be moved off of the node (maybe they can't due to disk space, for instance), and returnfalseif it's never going to be possible to decommission this node
Additionally, we'd want a way to recommission a node in the event it was decommissioned but an admin changed their mind:
PUT /_cluster/allocation/recommission
{
"nodes": ["node-id-2"]
}There's a lot of additional ideas around this (potential things like "as soon as a node has moved all data off, automatically shut itself down"), but this is a bare bones issue to get discussion started.