add node shutdown KEP

yastij · yastij · commit a42483344002 · 2019-06-26T17:12:00.000+02:00
Signed-off-by: Yassine TIJANI &lt;ytijani@vmware.com&gt;
diff --git a/keps/sig-node/20190626-node-shutdown-state.md b/keps/sig-node/20190626-node-shutdown-state.md
@@ -0,0 +1,237 @@
+---
+title: Node shutdown taint
+authors:
+  - "@yastij"
+owning-sig: sig-node
+participating-sigs:
+  - sig-storage
+  - sig-cloud-provider
+  - sig-scalability
+approvers:
+  - "@smarterclayton"
+  - "@liggitt"
+  - "@yujuhong"
+  - "@saadali"
+reviewers:
+  - "@jingxu"
+  - "@andrewsykim"
+  - "@derekwaynecarr"
+editor: Yassine Tijani
+creation-date: 2019-06-26
+last-updated: 2019-06-26
+status: implementable
+---
+
+# Node shutdown Taint
+
+This includes the Summary and Motivation sections.
+
+## Table of Contents
+
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [Implementation Details](#implementation-details)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Test Plan](#test-plan)
+  - [Graduation Criteria](#graduation-criteria)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+  - [Introduce pod level resource requirements](#introduce-pod-level-resource-requirements)
+  - [Leaving the PodSpec unchanged](#leaving-the-podspec-unchanged)
+
+## Release Signoff Checklist
+
+- [] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
+- [ ] KEP approvers have set the KEP status to `implementable`
+- [ ] Design details are appropriately documented
+- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
+- [ ] Graduation criteria is in place
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those
+approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
+
+**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues
+[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
+[kubernetes/website]: https://github.com/kubernetes/website
+
+## Summary
+
+In case of an instance being shutdown (i.e. not running) the control plane doesn't make the right
+assumptions to enable stateful workload to fail-over, This KEP introduces a flow to do so.
+
+## Motivation
+
+One of the fundamental guarantees of Kubernetes is Pod safety[1], but in some specific situations it’s been too conservative. When a node is shutdown the control plane do not distinguish between a kubelet or node failure, this leads to a situation where it cannot make the right assumptions to preserve the availability of stateful workloads, this manifests as volumes not being detached due the fact that the control plane is not able to determine if containers are still running.
+
+The impact of this behaviour wasn’t visible as when a node is shutdown some of the cloud providers (if not most) were deleting the node object. This led the podgc controller to remove orphaned pods using a 0 grace-period[2]. As a consequence, the attach_detach controller received a watch notification for pod deletion and removed that pod and its attachment from its desired state. Finally when reconciling the actual state with desired state, the reconciler triggers the detach.
+
+This behaviour was very racy, as while the node was being deleted, it could come back and register itself, list its pods and start the containers while volumes are being detached, leading to data corruption. To mitigate this, we explicitly set requirements on handling the node objects for cloud providers[3], i.e. not deleting node object unless the instance do not exist anymore.
+
+This made it noticeable to users, several were complaining about it, as a temporary mitigation we introduced taint called node-shutdown enabling cluster administrators to write automation against it, although most of the solutions at the time were racy.
+
+Solving this issue requires also taking into account the existing feature set, the most notable one is the alpha bootstrap checkpointing[4], but since it’s alpha it is not enabled by default which makes the kubelet use usual flow (i.e check in to get the latest state from the apiserver). Also, sig-cluster-lifecycle (which is its main consumer) decided to not pursue this path and start its deprecation[5]
+
+[1] [Pod safety design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md)
+[2] [Gc orphaned pods](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podgc/gc_controller.go#L97)
+[3] [requirements on node objects](https://github.com/kubernetes/cloud-provider/blob/master/cloud.go#L167)
+[8] [bootstrap checkpointing](https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/0004-bootstrap-checkpointing.md)
+[9] [bootstrap checkpointing deprecation intent](https://github.com/kubernetes/enhancements/issues/1103)
+
+### Goals
+
+* Increase the availability of stateful workloads
+* Automate self-healing for stateful workloads
+
+### Non-Goals
+
+* Introduce a new API to report node state
+* Handle cases where the Kubelet itself is unresponsive
+
+
+## Proposal
+
+user stories:
+
+* as a cluster administrator I want my stateful workload to failover in case of a node is shutdown without any intervention
+
+* as a developer I would like to rely on a self-healing platform for my stateful workloads
+
+This KEP do not intend to introduce any new API, instead we plan on re-using the Lease API and a mechanism which is similar to leader election (if not the same).
+As stated this KEP is all about finding the right inter-lock between all the actors. The main are:
+
+* Kube-Controller-Manager
+* Kubelets
+
+
+
+#### kube-controller-manager
+
+
+kube-controller-manager needs to provide a controller that:
+
+Calls the cloud provider
+Apply the node-shutdown taint
+Executes the detach operation
+Removes the taint
+
+given the current design of our controllers and their following of the single-responsibility pattern, we’ll introduce a new controller called gc_volume_controller under pkg/controller/gc, we will also make the attach_detach_controller skip shutdown nodes to avoid concurrent detach requests.
+
+This gc_volume_controller watches node objects and when notified (each step describe the possible race condition that could happen)
+
+1. checks the node state via IsShutdownByProviderID(): if there’s a race condition with the cloud provider (i.e. the instance came back see 2.a)
+2. apply a node-shutdown taint
+3. tries to hold node heartbeat lease
+  3.a. fails: backoff and retry if max retries hit -> aborts
+  3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start
+4. register node as being processed in order to renew the Lease periodically
+5. delete the pod with a 0-grace-period
+6. detach the volumes
+7. once 6. returns untaint the node and stop renewing the Lease
+
+
+To avoid being too reactive in 2. we can backoff and not try to acquire the lock immediately to avoid evicting too fast
+
+#### Kubelet
+
+we need also to apply changes to the kubelet, at startup the kubelet will:
+
+1. acquire the Lease:
+  a. fails: do not start any pods, unless the pod is:
+     i. Tolerate the node shutdown taint
+     ii. Is a static Pod 
+  b. success: start normally
+
+
+
+### Risks and Mitigations
+
+This KEP introduces changes, and if not tested correctly could result in data corruption for users.
+To mitigate this we plan to have a high test coverage and to introduce this enhancement as an alpha
+
+## Design Details
+
+### Test Plan
+
+This feature will be tested with the following approaches:
+
+- unit tests
+- integration
+- e2e
+
+we also plan to test this with different version Skews.
+
+### Graduation Criteria
+
+This KEP will be treated as a new feature, and will be introduced with a new feature gate, NodeShutdownFailover.
+
+This enhancement will go through the following maturity levels: alpha, beta and stable.
+
+Graduation criteria between these levels to be determined.
+
+### Upgrade / Downgrade Strategy
+
+If applicable, how will the component be upgraded and downgraded? Make sure this is in the test plan.
+
+Consider the following in developing an upgrade/downgrade strategy for this enhancement:
+- to keep the old behavior: disable the feature gate
+- make sure that the new behavior is not running against lower-version Kubelets
+
+### Version Skew Strategy
+
+This feature requires version parity between the control plane and the kubelet, this behavior shouldn't be enabled in case of older Kubelets.
+
+## Implementation History
+
+- 2019-06-26: Initial KEP published.
+
+## Drawbacks
+
+This design adds a new traffic into the apiserver, we should make sure that this feature do not break our SLOs
+
+## Alternatives
+
+In order to achieve failover properties, the control plane needs to coordinate with Kubelet.
+
+### rely on the taint as a lock mechanism (taint and untaint by the same actor)
+
+relying on the node shutdown taint as lock between the control plane and the kubelet can be compelling. 
+But if we do so there still can be a race condition
+
+1. the control plane detects a node is shutdown an applies the Taint
+2. it start evicting the pods
+3. node comes back
+4. the controller untaint the node before the kubelet being up
+5. the kubelet is up, fetch the node object withtout the node shutdown Taint
+6. start the pods while volumes are being detached
+
+### rely on the taint as a lock mechanism (taint and untaint by the same actor)
+
+1. the node lifecycle taints the node objects when it observe a node is shutdown (but doesn't untaint them)
+2. gc_volume_controller starts evicting the pods and detaching volumes
+3. gc_volume_controller untaint the node when the eviction is done
+
+the kubelet will:
+
+1. start and fetch the node object from the apiserver
+  1.a. fails: doesn't start containers unless it's from a static pod or a daemonSet or tolerates the taint
+  1.b. success
+2. check for the taint
+  2.a. exists: same as 1.a
+  2.b. doesn't exist: start normally
+  
+This solution requires doing things at multiple places, also handling cases where the controller-manager restarts in the middle of operations
+is challenging (e.g. how to untaint nodes after a crash)