Skip to content

Commit a424833

Browse files
committed
add node shutdown KEP
Signed-off-by: Yassine TIJANI <[email protected]>
1 parent 0e8c157 commit a424833

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
---
2+
title: Node shutdown taint
3+
authors:
4+
- "@yastij"
5+
owning-sig: sig-node
6+
participating-sigs:
7+
- sig-storage
8+
- sig-cloud-provider
9+
- sig-scalability
10+
approvers:
11+
- "@smarterclayton"
12+
- "@liggitt"
13+
- "@yujuhong"
14+
- "@saadali"
15+
reviewers:
16+
- "@jingxu"
17+
- "@andrewsykim"
18+
- "@derekwaynecarr"
19+
editor: Yassine Tijani
20+
creation-date: 2019-06-26
21+
last-updated: 2019-06-26
22+
status: implementable
23+
---
24+
25+
# Node shutdown Taint
26+
27+
This includes the Summary and Motivation sections.
28+
29+
## Table of Contents
30+
31+
- [Release Signoff Checklist](#release-signoff-checklist)
32+
- [Summary](#summary)
33+
- [Motivation](#motivation)
34+
- [Goals](#goals)
35+
- [Non-Goals](#non-goals)
36+
- [Proposal](#proposal)
37+
- [Implementation Details](#implementation-details)
38+
- [Risks and Mitigations](#risks-and-mitigations)
39+
- [Design Details](#design-details)
40+
- [Test Plan](#test-plan)
41+
- [Graduation Criteria](#graduation-criteria)
42+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
43+
- [Version Skew Strategy](#version-skew-strategy)
44+
- [Implementation History](#implementation-history)
45+
- [Drawbacks](#drawbacks)
46+
- [Alternatives](#alternatives)
47+
- [Introduce pod level resource requirements](#introduce-pod-level-resource-requirements)
48+
- [Leaving the PodSpec unchanged](#leaving-the-podspec-unchanged)
49+
50+
## Release Signoff Checklist
51+
52+
- [] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
53+
- [ ] KEP approvers have set the KEP status to `implementable`
54+
- [ ] Design details are appropriately documented
55+
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
56+
- [ ] Graduation criteria is in place
57+
- [ ] "Implementation History" section is up-to-date for milestone
58+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
59+
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
60+
61+
**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those
62+
approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
63+
64+
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
65+
66+
[kubernetes.io]: https://kubernetes.io/
67+
[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues
68+
[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
69+
[kubernetes/website]: https://github.com/kubernetes/website
70+
71+
## Summary
72+
73+
In case of an instance being shutdown (i.e. not running) the control plane doesn't make the right
74+
assumptions to enable stateful workload to fail-over, This KEP introduces a flow to do so.
75+
76+
## Motivation
77+
78+
One of the fundamental guarantees of Kubernetes is Pod safety[1], but in some specific situations it’s been too conservative. When a node is shutdown the control plane do not distinguish between a kubelet or node failure, this leads to a situation where it cannot make the right assumptions to preserve the availability of stateful workloads, this manifests as volumes not being detached due the fact that the control plane is not able to determine if containers are still running.
79+
80+
The impact of this behaviour wasn’t visible as when a node is shutdown some of the cloud providers (if not most) were deleting the node object. This led the podgc controller to remove orphaned pods using a 0 grace-period[2]. As a consequence, the attach_detach controller received a watch notification for pod deletion and removed that pod and its attachment from its desired state. Finally when reconciling the actual state with desired state, the reconciler triggers the detach.
81+
82+
This behaviour was very racy, as while the node was being deleted, it could come back and register itself, list its pods and start the containers while volumes are being detached, leading to data corruption. To mitigate this, we explicitly set requirements on handling the node objects for cloud providers[3], i.e. not deleting node object unless the instance do not exist anymore.
83+
84+
This made it noticeable to users, several were complaining about it, as a temporary mitigation we introduced taint called node-shutdown enabling cluster administrators to write automation against it, although most of the solutions at the time were racy.
85+
86+
Solving this issue requires also taking into account the existing feature set, the most notable one is the alpha bootstrap checkpointing[4], but since it’s alpha it is not enabled by default which makes the kubelet use usual flow (i.e check in to get the latest state from the apiserver). Also, sig-cluster-lifecycle (which is its main consumer) decided to not pursue this path and start its deprecation[5]
87+
88+
[1] [Pod safety design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md)
89+
[2] [Gc orphaned pods](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podgc/gc_controller.go#L97)
90+
[3] [requirements on node objects](https://github.com/kubernetes/cloud-provider/blob/master/cloud.go#L167)
91+
[8] [bootstrap checkpointing](https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/0004-bootstrap-checkpointing.md)
92+
[9] [bootstrap checkpointing deprecation intent](https://github.com/kubernetes/enhancements/issues/1103)
93+
94+
### Goals
95+
96+
* Increase the availability of stateful workloads
97+
* Automate self-healing for stateful workloads
98+
99+
### Non-Goals
100+
101+
* Introduce a new API to report node state
102+
* Handle cases where the Kubelet itself is unresponsive
103+
104+
105+
## Proposal
106+
107+
user stories:
108+
109+
* as a cluster administrator I want my stateful workload to failover in case of a node is shutdown without any intervention
110+
111+
* as a developer I would like to rely on a self-healing platform for my stateful workloads
112+
113+
This KEP do not intend to introduce any new API, instead we plan on re-using the Lease API and a mechanism which is similar to leader election (if not the same).
114+
As stated this KEP is all about finding the right inter-lock between all the actors. The main are:
115+
116+
* Kube-Controller-Manager
117+
* Kubelets
118+
119+
120+
121+
#### kube-controller-manager
122+
123+
124+
kube-controller-manager needs to provide a controller that:
125+
126+
Calls the cloud provider
127+
Apply the node-shutdown taint
128+
Executes the detach operation
129+
Removes the taint
130+
131+
given the current design of our controllers and their following of the single-responsibility pattern, we’ll introduce a new controller called gc_volume_controller under pkg/controller/gc, we will also make the attach_detach_controller skip shutdown nodes to avoid concurrent detach requests.
132+
133+
This gc_volume_controller watches node objects and when notified (each step describe the possible race condition that could happen)
134+
135+
1. checks the node state via IsShutdownByProviderID(): if there’s a race condition with the cloud provider (i.e. the instance came back see 2.a)
136+
2. apply a node-shutdown taint
137+
3. tries to hold node heartbeat lease
138+
3.a. fails: backoff and retry if max retries hit -> aborts
139+
3.b. success: hold the lease for this node: if the node comes back at this moment it won’t be able to acquire the lease -> pods won’t start
140+
4. register node as being processed in order to renew the Lease periodically
141+
5. delete the pod with a 0-grace-period
142+
6. detach the volumes
143+
7. once 6. returns untaint the node and stop renewing the Lease
144+
145+
146+
To avoid being too reactive in 2. we can backoff and not try to acquire the lock immediately to avoid evicting too fast
147+
148+
#### Kubelet
149+
150+
we need also to apply changes to the kubelet, at startup the kubelet will:
151+
152+
1. acquire the Lease:
153+
a. fails: do not start any pods, unless the pod is:
154+
i. Tolerate the node shutdown taint
155+
ii. Is a static Pod
156+
b. success: start normally
157+
158+
159+
160+
### Risks and Mitigations
161+
162+
This KEP introduces changes, and if not tested correctly could result in data corruption for users.
163+
To mitigate this we plan to have a high test coverage and to introduce this enhancement as an alpha
164+
165+
## Design Details
166+
167+
### Test Plan
168+
169+
This feature will be tested with the following approaches:
170+
171+
- unit tests
172+
- integration
173+
- e2e
174+
175+
we also plan to test this with different version Skews.
176+
177+
### Graduation Criteria
178+
179+
This KEP will be treated as a new feature, and will be introduced with a new feature gate, NodeShutdownFailover.
180+
181+
This enhancement will go through the following maturity levels: alpha, beta and stable.
182+
183+
Graduation criteria between these levels to be determined.
184+
185+
### Upgrade / Downgrade Strategy
186+
187+
If applicable, how will the component be upgraded and downgraded? Make sure this is in the test plan.
188+
189+
Consider the following in developing an upgrade/downgrade strategy for this enhancement:
190+
- to keep the old behavior: disable the feature gate
191+
- make sure that the new behavior is not running against lower-version Kubelets
192+
193+
### Version Skew Strategy
194+
195+
This feature requires version parity between the control plane and the kubelet, this behavior shouldn't be enabled in case of older Kubelets.
196+
197+
## Implementation History
198+
199+
- 2019-06-26: Initial KEP published.
200+
201+
## Drawbacks
202+
203+
This design adds a new traffic into the apiserver, we should make sure that this feature do not break our SLOs
204+
205+
## Alternatives
206+
207+
In order to achieve failover properties, the control plane needs to coordinate with Kubelet.
208+
209+
### rely on the taint as a lock mechanism (taint and untaint by the same actor)
210+
211+
relying on the node shutdown taint as lock between the control plane and the kubelet can be compelling.
212+
But if we do so there still can be a race condition
213+
214+
1. the control plane detects a node is shutdown an applies the Taint
215+
2. it start evicting the pods
216+
3. node comes back
217+
4. the controller untaint the node before the kubelet being up
218+
5. the kubelet is up, fetch the node object withtout the node shutdown Taint
219+
6. start the pods while volumes are being detached
220+
221+
### rely on the taint as a lock mechanism (taint and untaint by the same actor)
222+
223+
1. the node lifecycle taints the node objects when it observe a node is shutdown (but doesn't untaint them)
224+
2. gc_volume_controller starts evicting the pods and detaching volumes
225+
3. gc_volume_controller untaint the node when the eviction is done
226+
227+
the kubelet will:
228+
229+
1. start and fetch the node object from the apiserver
230+
1.a. fails: doesn't start containers unless it's from a static pod or a daemonSet or tolerates the taint
231+
1.b. success
232+
2. check for the taint
233+
2.a. exists: same as 1.a
234+
2.b. doesn't exist: start normally
235+
236+
This solution requires doing things at multiple places, also handling cases where the controller-manager restarts in the middle of operations
237+
is challenging (e.g. how to untaint nodes after a crash)

0 commit comments

Comments
 (0)