Skip to content

NTH should issue lifecycle heartbeats #493

@jrsherry

Description

@jrsherry

I've been using the NTH in queue processor mode. This implementation uses a lifecycle hook associated with the node instance to trigger the NTH to cordon/drain. Lifecycle hooks support two timeouts; the global timeout (max 48hrs) and the heartbeat timeout (max 7200 seconds).
https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_LifecycleHook.html

This means that if the NTH doesn't issue lifecycle heartbeats during the draining process, the node will be terminated (assuming CONTINUE on timeout vice ABANDON) within 7200 seconds (whatever the hook's heartbeat timeout is configured to).

This is problematic if you've got termination grace periods that can exceed 7200 seconds. The node will be terminated before the pod can safely evict.

If the NTH was issuing lifecycle heartbeats during the node drain, then this would effectively support grace periods that extend to the 48 hour global timeout.
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions