-
Notifications
You must be signed in to change notification settings - Fork 278
Description
I took advantage of the rare incidence of a scheduled event on a node in our test EKS cluster to update to the latest NTH (1.12.0 via latest helm chart), running in IMDS mode still. Unfortunately it didn't start draining, despite getting the below notice I was able to capture manually:
$ curl http://169.254.169.254/latest/meta-data/events/maintenance/scheduled
[ {
"NotBefore" : "25 Feb 2021 02:00:00 GMT",
"Code" : "instance-stop",
"Description" : "The instance is running on degraded hardware",
"EventId" : "instance-event-087e363178449a61e",
"State" : "active"
} helm settings (set via helmfile but I suspect you understand):
repositories:
- name: aws-eks-charts
url: https://aws.github.io/eks-charts
releases:
- name: aws-node-termination-handler
namespace: kube-system
chart: aws-eks-charts/aws-node-termination-handler
version: 0.13.3
values:
- enableScheduledEventDraining: true
enableSpotInterruptionDraining: true
enableRebalanceMonitoring: false
taintNode: true
webhookURL: <redacted>
enablePrometheusServer: true
nodeSelector:
lifecycle: spot
podMonitor:
create: true
jsonLogging: true
logLevel: DEBUGdeamonset pod logs:
{"time":"2021-02-11T09:35:48Z","message":"Trying to get token from IMDSv2"}
{"level":"info","time":"2021-02-11T09:35:48Z","message":"Starting to serve handler /metrics, port 9092"}
{"time":"2021-02-11T09:35:48Z","message":"Got token from IMDSv2"}
{"metadata":{"accountId":"XXXXXXXX","instanceId":"i-0415e8c43a18dccc7","instanceType":"t2.medium","publicHostname":"ec2-54-229-104-67.eu-west-1.compute.amazonaws.com","publicIp":"54.229.104.67","localHostname":"ip-172-31-13-215.empire.lyst","privateIp":"172.31.13.215","availabilityZone":"eu-west-1b","region":"eu-west-1"},"time":"2021-02-11T09:35:48Z","message":"Startup Metadata Retrieved"}
{"dry_run":false,"node_name":"ip-172-31-13-215.eu-west-1.compute.internal","metadata_url":"http://169.254.169.254","kubernetes_service_host":"10.100.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":false,"enable_rebalance_monitoring":false,"metadata_tries":3,"cordon_only":false,"taint_node":true,"json_logging":true,"log_level":"DEBUG","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":true,"prometheus_server_port":9092,"aws_region":"eu-west-1","aws_endpoint":"","queue_url":"","check_asg_tag_before_draining":true,"ManagedAsgTag":"aws-node-termination-handler/managed","time":"2021-02-11T09:35:48Z","message":"aws-node-termination-handler arguments"}
{"level":"warn","time":"2021-02-11T09:35:48Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:50Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:52Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:54Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for interruption events"}
{"time":"2021-02-11T09:35:56Z","message":"Kubernetes AWS Node Termination Handler has started successfully!"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for event cancellations"}
{"event_type":"SCHEDULED_EVENT","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"event_type":"SPOT_ITN","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:35:58Z","message":"Unable to parse scheduled event end time, continuing"}
{"event":{"EventID":"instance-event-087e363178449a61e","Kind":"SCHEDULED_EVENT","Description":"instance-stop will occur between 25 Feb 2021 02:00:00 GMT and because The instance is running on degraded hardware\n","State":"active","NodeName":"ip-172-31-13-215.eu-west-1.compute.internal","InstanceID":"","StartTime":"2021-02-25T02:00:00Z","EndTime":"2021-02-25T02:00:00Z","Drained":false,"InProgress":false},"time":"2021-02-11T09:35:58Z","message":"Adding new event to the event store"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:00Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:02Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:04Z","message":"Unable to parse scheduled event end time, continuing"}
... <repeats for ever>
I haven't read through the code enough yet, but logging the above before I lose it & as I'm confused so far - the repeated error message about parsing seems to only come from decoding the raw JSON response which looks in the correct format as far as I can see. Plus that one EventID line looks to have parsed it correctly - so what is the parsing routine still complaining about and why has it not done anything yet?
Also "DEBUG" loglevel doesn't seem to have provide any extra information, which is a little unfortunate. I had originally deployed it with default "info" but from memory it looked similar to the above...