-
Notifications
You must be signed in to change notification settings - Fork 278
Description
I've been using the NTH in queue processor mode. This implementation uses a lifecycle hook associated with the node instance to trigger the NTH to cordon/drain. Lifecycle hooks support two timeouts; the global timeout (max 48hrs) and the heartbeat timeout (max 7200 seconds).
https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_LifecycleHook.html
This means that if the NTH doesn't issue lifecycle heartbeats during the draining process, the node will be terminated (assuming CONTINUE on timeout vice ABANDON) within 7200 seconds (whatever the hook's heartbeat timeout is configured to).
This is problematic if you've got termination grace periods that can exceed 7200 seconds. The node will be terminated before the pod can safely evict.
If the NTH was issuing lifecycle heartbeats during the node drain, then this would effectively support grace periods that extend to the 48 hour global timeout.
https://docs.aws.amazon.com/cli/latest/reference/autoscaling/record-lifecycle-action-heartbeat.html