Skip to content

Node on "degraded hardware" not evacuated & terminated ASAP #361

@tyrken

Description

@tyrken

I took advantage of the rare incidence of a scheduled event on a node in our test EKS cluster to update to the latest NTH (1.12.0 via latest helm chart), running in IMDS mode still. Unfortunately it didn't start draining, despite getting the below notice I was able to capture manually:

$ curl http://169.254.169.254/latest/meta-data/events/maintenance/scheduled
[ {
  "NotBefore" : "25 Feb 2021 02:00:00 GMT",
  "Code" : "instance-stop",
  "Description" : "The instance is running on degraded hardware",
  "EventId" : "instance-event-087e363178449a61e",
  "State" : "active"
} 

helm settings (set via helmfile but I suspect you understand):

repositories:
  - name: aws-eks-charts
    url: https://aws.github.io/eks-charts
releases:
  - name: aws-node-termination-handler
    namespace: kube-system
    chart: aws-eks-charts/aws-node-termination-handler
    version: 0.13.3
    values:
      - enableScheduledEventDraining: true
        enableSpotInterruptionDraining: true
        enableRebalanceMonitoring: false
        taintNode: true
        webhookURL: <redacted>
        enablePrometheusServer: true
        nodeSelector:
          lifecycle: spot
        podMonitor:
          create: true
        jsonLogging: true
        logLevel: DEBUG

deamonset pod logs:

{"time":"2021-02-11T09:35:48Z","message":"Trying to get token from IMDSv2"}
{"level":"info","time":"2021-02-11T09:35:48Z","message":"Starting to serve handler /metrics, port 9092"}
{"time":"2021-02-11T09:35:48Z","message":"Got token from IMDSv2"}
{"metadata":{"accountId":"XXXXXXXX","instanceId":"i-0415e8c43a18dccc7","instanceType":"t2.medium","publicHostname":"ec2-54-229-104-67.eu-west-1.compute.amazonaws.com","publicIp":"54.229.104.67","localHostname":"ip-172-31-13-215.empire.lyst","privateIp":"172.31.13.215","availabilityZone":"eu-west-1b","region":"eu-west-1"},"time":"2021-02-11T09:35:48Z","message":"Startup Metadata Retrieved"}
{"dry_run":false,"node_name":"ip-172-31-13-215.eu-west-1.compute.internal","metadata_url":"http://169.254.169.254","kubernetes_service_host":"10.100.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":false,"enable_rebalance_monitoring":false,"metadata_tries":3,"cordon_only":false,"taint_node":true,"json_logging":true,"log_level":"DEBUG","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":true,"prometheus_server_port":9092,"aws_region":"eu-west-1","aws_endpoint":"","queue_url":"","check_asg_tag_before_draining":true,"ManagedAsgTag":"aws-node-termination-handler/managed","time":"2021-02-11T09:35:48Z","message":"aws-node-termination-handler arguments"}
{"level":"warn","time":"2021-02-11T09:35:48Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:50Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:52Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:54Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for interruption events"}
{"time":"2021-02-11T09:35:56Z","message":"Kubernetes AWS Node Termination Handler has started successfully!"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for event cancellations"}
{"event_type":"SCHEDULED_EVENT","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"event_type":"SPOT_ITN","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:35:58Z","message":"Unable to parse scheduled event end time, continuing"}
{"event":{"EventID":"instance-event-087e363178449a61e","Kind":"SCHEDULED_EVENT","Description":"instance-stop will occur between 25 Feb 2021 02:00:00 GMT and  because The instance is running on degraded hardware\n","State":"active","NodeName":"ip-172-31-13-215.eu-west-1.compute.internal","InstanceID":"","StartTime":"2021-02-25T02:00:00Z","EndTime":"2021-02-25T02:00:00Z","Drained":false,"InProgress":false},"time":"2021-02-11T09:35:58Z","message":"Adding new event to the event store"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:00Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:02Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:04Z","message":"Unable to parse scheduled event end time, continuing"}
... <repeats for ever>

I haven't read through the code enough yet, but logging the above before I lose it & as I'm confused so far - the repeated error message about parsing seems to only come from decoding the raw JSON response which looks in the correct format as far as I can see. Plus that one EventID line looks to have parsed it correctly - so what is the parsing routine still complaining about and why has it not done anything yet?

Also "DEBUG" loglevel doesn't seem to have provide any extra information, which is a little unfortunate. I had originally deployed it with default "info" but from memory it looked similar to the above...

Metadata

Metadata

Assignees

Labels

Priority: LowThis issue will not be seen by most users. The issue is a very specific use case or corner caseType: EnhancementNew feature or requeststalebot-ignoreTo NOT let the stalebot update or close the Issue / PR

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions