-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
:Core/Infra/Node LifecycleNode startup, bootstrapping, and shutdownNode startup, bootstrapping, and shutdown>featureMetaTeam:Core/InfraMeta label for core/infra teamMeta label for core/infra team
Description
This issue supersedes #49064, which will be closed.
The node shutdown API should provide a safe way for operators to shutdown a node ensuring all relevant orchestration steps are taken to prevent cluster instability and data loss. The feature can be used to decommission, power cycle or upgrade nodes.
An example of marking a node as part of the shutdown:
PUT /_nodes/<node_id>/shutdown
{
"type": "remove",¹
"reason": "shutdown of node so we can remove it from the cluster"²
}
¹ The type of decommission, in this case either a "remove" (the node is never coming back) or a "restart"
² A user-enterable free text block description of the reason why the node is being shut downAnd retrieving the shutdown status:
GET /_nodes/<node_id>/shutdown
{
"node": "data-node-1",
"node_id": "node-id-1",
"type": "remove",
"reason": "shutdown of node so we can remove it from the cluster"
"status": {¹
"shutdown_status": "IN_PROGRESS",²
"shard_migration": {
"status": "IN_PROGRESS",
"shard_migrations_remaining": 7,³
"time_started": "<user readable date>",
"time_started_millis": 234091892
},
"persistent_tasks": {⁴
"status": "IN_PROGRESS",
"tasks_remaining": 2,⁵
"error": "ICouldntStopTheTasksException[i can't do that dave]...etc stacktrack etc...",
"time_started": "<user readable date>",
"time_started_millis": 128391987
},
"plugins": {⁶
"status": "NOT_STARTED",
},
"data_loss_on_removal": false⁷
},
"time_since_shutdown": "1.2h",⁸
"time_since_shutdown_millis": 4320000,
"shutdown_started": "<user readable date>",9
"shutdown_started_millis": 128391987
}
1. Shows the current state of the shutdown for this node. This can be used by operators to track progress
2. Overall shutdown status. Possible values are: "IN_PROGRESS", "COMPLETE", "STALLED". IF the shutdown is STALLED a error field will also be returned containing the reason the shutdown is stalled (e.g. no nodes can take remaining shards)
3. How many shards remain to be migrated off of this node
4. Whether in progress persistent tasks have been halt and new tasks have been blocked
5. The number of tasks that need to be completed before shutdown
6. Whether plugins have indicated that they are ready for shutdown
7. Whether data loss could occur if the node was terminated now
8. How long the shutdown has been ongoing.
9. When the shutdown was initiated.Here are some high-level tasks that need to be completed for this:
- Add cluster state building blocks for tracking node shutdown status (@gwbrown) Add custom metadata to track node shutdowns #70044
- Implement full status API that reads shutdown status (@gwbrown) Integrate Node Shutdown API with cluster metadata #71162
- Add REST scaffolding and feature flag for the shutdown APIs (@dakrone) Add REST scaffolding for node shutdown API #70697
- Mechanism for migrating data away from a decommissioned node
- Allocation decider (@gwbrown) Add an allocation decider to prevent allocating shards to nodes which are preparing for shutdown #71658
- Ensure status is updated for data migration (@gwbrown) Expose shard migration status in Node Shutdown Status API #73873
- Mechanism to handle persistent tasks
- Ensure persistent tasks are not assigned to nodes shutting down (@dakrone) Don't assign persistent tasks to nodes shutting down #72260
- Mechanism for a node being restarted to retain its data (@gwbrown) Delay shard reassignment from nodes which are known to be restarting #75606
- Method to avoid needing to stop ILM (@dakrone) Make ILM aware of node shutdown #73690
- Check within the plugin lifecycle for the safety of shutdown (@dakrone) Make ILM aware of node shutdown #73690
- Update ML to make use of the
ShutdownAwarePluginand stop its work while shutting down
- Update ML to make use of the
- Convert system property feature flag into yml setting that cannot be enabled on a non-release build (@dakrone) Convert node shutdown system property feature flag to setting #74267
- Remove feature flag (when ready for release) (@gwbrown) Remove Node Shutdown API feature flag #76588
- Flip feature flag to default to "true" for snapshot builds (@dakrone) Flip node shutdown feature flag to default to true on snapshot builds #75962
Phase 2:
- Add "REPLACE" shutdown type
- Upgrades to persistent task handling
- Enhance data tier allocation decider to allow migrating to a different tier if all nodes in a certain tier are shutdown (possibly?)
Metadata
Metadata
Labels
:Core/Infra/Node LifecycleNode startup, bootstrapping, and shutdownNode startup, bootstrapping, and shutdown>featureMetaTeam:Core/InfraMeta label for core/infra teamMeta label for core/infra team