Node on "degraded hardware" not evacuated & terminated ASAP

I took advantage of the rare incidence of a scheduled event on a node in our test EKS cluster to update to the latest NTH (1.12.0 via latest helm chart), running in IMDS mode still.  Unfortunately it didn't start draining, despite getting the below notice I was able to capture manually:

```bash
$ curl http://169.254.169.254/latest/meta-data/events/maintenance/scheduled
[ {
  "NotBefore" : "25 Feb 2021 02:00:00 GMT",
  "Code" : "instance-stop",
  "Description" : "The instance is running on degraded hardware",
  "EventId" : "instance-event-087e363178449a61e",
  "State" : "active"
} 
```

helm settings (set via helmfile but I suspect you understand):
```yaml
repositories:
  - name: aws-eks-charts
    url: https://aws.github.io/eks-charts
releases:
  - name: aws-node-termination-handler
    namespace: kube-system
    chart: aws-eks-charts/aws-node-termination-handler
    version: 0.13.3
    values:
      - enableScheduledEventDraining: true
        enableSpotInterruptionDraining: true
        enableRebalanceMonitoring: false
        taintNode: true
        webhookURL: <redacted>
        enablePrometheusServer: true
        nodeSelector:
          lifecycle: spot
        podMonitor:
          create: true
        jsonLogging: true
        logLevel: DEBUG
```

deamonset pod logs:
```
{"time":"2021-02-11T09:35:48Z","message":"Trying to get token from IMDSv2"}
{"level":"info","time":"2021-02-11T09:35:48Z","message":"Starting to serve handler /metrics, port 9092"}
{"time":"2021-02-11T09:35:48Z","message":"Got token from IMDSv2"}
{"metadata":{"accountId":"XXXXXXXX","instanceId":"i-0415e8c43a18dccc7","instanceType":"t2.medium","publicHostname":"ec2-54-229-104-67.eu-west-1.compute.amazonaws.com","publicIp":"54.229.104.67","localHostname":"ip-172-31-13-215.empire.lyst","privateIp":"172.31.13.215","availabilityZone":"eu-west-1b","region":"eu-west-1"},"time":"2021-02-11T09:35:48Z","message":"Startup Metadata Retrieved"}
{"dry_run":false,"node_name":"ip-172-31-13-215.eu-west-1.compute.internal","metadata_url":"http://169.254.169.254","kubernetes_service_host":"10.100.0.1","kubernetes_service_port":"443","delete_local_data":true,"ignore_daemon_sets":true,"pod_termination_grace_period":-1,"node_termination_grace_period":120,"enable_scheduled_event_draining":true,"enable_spot_interruption_draining":true,"enable_sqs_termination_draining":false,"enable_rebalance_monitoring":false,"metadata_tries":3,"cordon_only":false,"taint_node":true,"json_logging":true,"log_level":"DEBUG","webhook_proxy":"","uptime_from_file":"","enable_prometheus_server":true,"prometheus_server_port":9092,"aws_region":"eu-west-1","aws_endpoint":"","queue_url":"","check_asg_tag_before_draining":true,"ManagedAsgTag":"aws-node-termination-handler/managed","time":"2021-02-11T09:35:48Z","message":"aws-node-termination-handler arguments"}
{"level":"warn","time":"2021-02-11T09:35:48Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:50Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:52Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"level":"warn","time":"2021-02-11T09:35:54Z","message":"Error when trying to list Nodes w/ label, falling back to direct Get lookup of node"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for interruption events"}
{"time":"2021-02-11T09:35:56Z","message":"Kubernetes AWS Node Termination Handler has started successfully!"}
{"time":"2021-02-11T09:35:56Z","message":"Started watching for event cancellations"}
{"event_type":"SCHEDULED_EVENT","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"event_type":"SPOT_ITN","time":"2021-02-11T09:35:56Z","message":"Started monitoring for events"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:35:58Z","message":"Unable to parse scheduled event end time, continuing"}
{"event":{"EventID":"instance-event-087e363178449a61e","Kind":"SCHEDULED_EVENT","Description":"instance-stop will occur between 25 Feb 2021 02:00:00 GMT and  because The instance is running on degraded hardware\n","State":"active","NodeName":"ip-172-31-13-215.eu-west-1.compute.internal","InstanceID":"","StartTime":"2021-02-25T02:00:00Z","EndTime":"2021-02-25T02:00:00Z","Drained":false,"InProgress":false},"time":"2021-02-11T09:35:58Z","message":"Adding new event to the event store"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:00Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:02Z","message":"Unable to parse scheduled event end time, continuing"}
{"error":"parsing time \"\" as \"2 Jan 2006 15:04:05 GMT\": cannot parse \"\" as \"2\"","time":"2021-02-11T09:36:04Z","message":"Unable to parse scheduled event end time, continuing"}
... <repeats for ever>
```

I haven't read through the code enough yet, but logging the above before I lose it & as I'm confused so far - the repeated error message about parsing seems to only come from decoding the raw JSON response which looks in the correct format as far as I can see.  Plus that one EventID line looks to have parsed it correctly - so what is the parsing routine still complaining about and why has it not done anything yet?

Also "DEBUG" loglevel doesn't seem to have provide any extra information, which is a little unfortunate.  I had originally deployed it with default "info" but from memory it looked similar to the above...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Node on "degraded hardware" not evacuated & terminated ASAP #361

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node on "degraded hardware" not evacuated & terminated ASAP #361

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions