aws · bwagner5 · Jan 4, 2022 · brycahta · Jan 4, 2022 · brycahta
@@ -0,0 +1,94 @@
+# AWS Node Termination Handler v2
+*Authors: bwagner5@*
+
+---
+Definitions
+
+* NTH — (AWS) Node Termination Handler
+* IMDS — (EC2) Instance MetaData Service
+* k8s — Kubernetes
+* CRD - Custom Resource Definitions
+* Spot ITN - Spot Interruption Termination Event
+---
+
+## Mission Statement
+
+Simplify termination handling of nodes in Kubernetes clusters.
+
+## Background
+
+NTH has evolved from a simple k8s DaemonSet, now known as IMDS mode, to its current 
+iteration of a Deployment that acts as a queue consumer of EventBridge interruption 
+events. There are pros and cons to selecting which mode to use, but in general IMDS 
+mode is simple and handles Spot ITNs. The queue-processor mode of NTH is 
+more difficult to setup, but is also more powerful since it is able to handle ASG 
+termination events, EC2 instance status changes, Spot Rebalance Recommendations, Spot ITNs, and more.
+
+## Goals
+
+* Simplify the runtime configuration of NTH
+* Support of custom drain events 
+* Custom and configurable actions to drain events
+* Integration w/ Provisioners like Karpenter
+
+## Overview
+
+NTH Queue-Processor mode has accumulated a wealth of configuration options which can make it cumbersome to install.
+Configurations options and different modes of operation have caused the Helm Chart to become large and complex.
+
+NTH solves two problems with one tool: 
+  1. Termination Handling (cordon + drain) 
+  2. Interruption Handling (Spot ITNs, ASG Scale-In, etc.). 
+
+**Termination Handling** is a fairly simple operation that NTH should handle well and allow for its logic to be plugged into other systems that require 
+out-of-band termination handling.
+
+**Interruption Handling** in v1 only works with a subset of interruption events that are provided by EventBridge or IMDS.
+With NTH v2, custom interruption events should be possible via a labeling mechanism or a custom event monitor. 
+In addition, the actions taken by the interruption monitor should be customizable within a CRD definition.
+
+## The Terminator CRD
+
+To support an extensible system, NTH v2 will use a CRD called a `Terminator` which will represent a logical Node Termination Handler. 
+The Terminator CRD will define which nodes it is in charge of, which interruption events it will monitor, and what actions to take when 
+an interruption event is observed.
+
+NTH v2 will not use helm for configuration of termination options and will instead use the Terminator CRD. This will allow for easier tuning of the 
+runtime configuration as well as the ability to create logical Terminators rather than deploying NTH multiple times which can waste compute resources 
+and is tedious to manage. 
+
+Logical Terminators allow users to customize termination of specific node groups or provisioner managed nodes using a node selector. For example, a 
+logical terminator could be configured to respond to events on nodes with the label `training-group-1`.  Nodes in `training-group-1` may need to some 
+extra time to drain, which can be configured on the Terminator resource in the `pod-termination-grace-period` and the `node-termination-grace-period`. 
+Other groups could use a second logical Terminator resource with more aggressive grace periods.
+
+### Example Terminator:
+
+```
+apiVersion: terminator.k8s.aws/v1alpha1
+kind: Terminator
+metadata:
+  name: default
+spec:
+  matchLabels:
+     node.k8s.aws/capacity-type: spot 
+  queueURL: https://sqs.us-east-42.amazonaws.com/my-nth-q
+  events:
+     SPOT_ITN: ["CORDON", "DRAIN"]
+     SPOT_REBALANCE: ["NO_ACTION"]
+     ASG_TERMINATION: ["CORDON", "DRAIN"]
+     CUSTOM_EVENT1: ["tag:terminate-me"]
+  drainConfig:
+     ignoreDaemonSets: true
+     deleteLocalData: true
+     podTerminationGracePeriod: 30
+     nodeTerminationGracePeriod: 90
+  webhook:
+     url: mywebhook.example.com
+     proxy: myproxy.example.com
+     headers:
+        - context-type: application/json
+     template: |
+        {"text":"[NTH][Instance Interruption] EventID: {{ .EventID }} - Kind: {{ .Kind }} - Instance: {{ .InstanceID }} - Node: {{ .NodeName }} - Description: {{ .Description }} - Start Time: {{ .StartTime }}"}
+...
+```