diff --git a/designs/v2.md b/designs/v2.md new file mode 100644 index 00000000..99487089 --- /dev/null +++ b/designs/v2.md @@ -0,0 +1,94 @@ +# AWS Node Termination Handler v2 +*Authors: bwagner5@* + +--- +Definitions + +* NTH — (AWS) Node Termination Handler +* IMDS — (EC2) Instance MetaData Service +* k8s — Kubernetes +* CRD - Custom Resource Definitions +* Spot ITN - Spot Interruption Termination Event +--- + +## Mission Statement + +Simplify termination handling of nodes in Kubernetes clusters. + +## Background + +NTH has evolved from a simple k8s DaemonSet, now known as IMDS mode, to its current +iteration of a Deployment that acts as a queue consumer of EventBridge interruption +events. There are pros and cons to selecting which mode to use, but in general IMDS +mode is simple and handles Spot ITNs. The queue-processor mode of NTH is +more difficult to setup, but is also more powerful since it is able to handle ASG +termination events, EC2 instance status changes, Spot Rebalance Recommendations, Spot ITNs, and more. + +## Goals + +* Simplify the runtime configuration of NTH +* Support of custom drain events +* Custom and configurable actions to drain events +* Integration w/ Provisioners like Karpenter + +## Overview + +NTH Queue-Processor mode has accumulated a wealth of configuration options which can make it cumbersome to install. +Configurations options and different modes of operation have caused the Helm Chart to become large and complex. + +NTH solves two problems with one tool: + 1. Termination Handling (cordon + drain) + 2. Interruption Handling (Spot ITNs, ASG Scale-In, etc.). + +**Termination Handling** is a fairly simple operation that NTH should handle well and allow for its logic to be plugged into other systems that require +out-of-band termination handling. + +**Interruption Handling** in v1 only works with a subset of interruption events that are provided by EventBridge or IMDS. +With NTH v2, custom interruption events should be possible via a labeling mechanism or a custom event monitor. +In addition, the actions taken by the interruption monitor should be customizable within a CRD definition. + +## The Terminator CRD + +To support an extensible system, NTH v2 will use a CRD called a `Terminator` which will represent a logical Node Termination Handler. +The Terminator CRD will define which nodes it is in charge of, which interruption events it will monitor, and what actions to take when +an interruption event is observed. + +NTH v2 will not use helm for configuration of termination options and will instead use the Terminator CRD. This will allow for easier tuning of the +runtime configuration as well as the ability to create logical Terminators rather than deploying NTH multiple times which can waste compute resources +and is tedious to manage. + +Logical Terminators allow users to customize termination of specific node groups or provisioner managed nodes using a node selector. For example, a +logical terminator could be configured to respond to events on nodes with the label `training-group-1`. Nodes in `training-group-1` may need to some +extra time to drain, which can be configured on the Terminator resource in the `pod-termination-grace-period` and the `node-termination-grace-period`. +Other groups could use a second logical Terminator resource with more aggressive grace periods. + +### Example Terminator: + +``` +apiVersion: terminator.k8s.aws/v1alpha1 +kind: Terminator +metadata: + name: default +spec: + matchLabels: + node.k8s.aws/capacity-type: spot + queueURL: https://sqs.us-east-42.amazonaws.com/my-nth-q + events: + SPOT_ITN: ["CORDON", "DRAIN"] + SPOT_REBALANCE: ["NO_ACTION"] + ASG_TERMINATION: ["CORDON", "DRAIN"] + CUSTOM_EVENT1: ["tag:terminate-me"] + drainConfig: + ignoreDaemonSets: true + deleteLocalData: true + podTerminationGracePeriod: 30 + nodeTerminationGracePeriod: 90 + webhook: + url: mywebhook.example.com + proxy: myproxy.example.com + headers: + - context-type: application/json + template: | + {"text":"[NTH][Instance Interruption] EventID: {{ .EventID }} - Kind: {{ .Kind }} - Instance: {{ .InstanceID }} - Node: {{ .NodeName }} - Description: {{ .Description }} - Start Time: {{ .StartTime }}"} +... +```