You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/1967-size-memory-backed-volumes/README.md
+70-40Lines changed: 70 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,10 +35,14 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
35
35
-[x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
36
36
-[x] (R) KEP approvers have approved the KEP status as `implementable`
37
37
-[x] (R) Design details are appropriately documented
38
-
-[x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
38
+
-[x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
39
+
-[x] e2e Tests for all Beta API Operations (endpoints)
40
+
-[x] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
41
+
-[x] (R) Minimum Two Week Window for GA e2e tests to prove flake free
39
42
-[x] (R) Graduation criteria is in place
43
+
-[x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
40
44
-[x] (R) Production readiness review completed
41
-
-[x] Production readiness review approved
45
+
-[x](R) Production readiness review approved
42
46
-[x] "Implementation History" section is up-to-date for milestone
43
47
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
44
48
-[x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
@@ -141,115 +145,141 @@ potentially inaccurate volume size based on node configuration.
141
145
142
146
### Feature Enablement and Rollback
143
147
144
-
_This section must be completed when targeting alpha to a release._
148
+
#### How can this feature be enabled / disabled in a live cluster?
145
149
146
-
***How can this feature be enabled / disabled in a live cluster?**
147
-
-[x] Feature gate (also fill in values in `kep.yaml`)
148
-
- Feature gate name: SizeMemoryBackedVolumes
149
-
- Components depending on the feature gate: kubelet
150
-
- Will enabling / disabling the feature require downtime or reprovisioning
151
-
of a node? No
150
+
-[x] Feature gate (also fill in values in `kep.yaml`)
151
+
- Feature gate name: SizeMemoryBackedVolumes
152
+
- Components depending on the feature gate: kubelet
153
+
- Will enabling / disabling the feature require downtime or reprovisioning
154
+
of a node? No
155
+
156
+
#### Does enabling the feature change any default behavior?
152
157
153
-
***Does enabling the feature change any default behavior?**
154
158
Yes, the kubelet will size the empty dir volume to match the precise
155
159
amount of memory the pod is able to write rather than over or undersizing.
156
160
Prior behavior is node dependent, and so pod authors had no mechanism
157
161
to control this behavior properly.
158
162
159
-
***Can the feature be disabled once it has been enabled (i.e. can we roll back
160
-
the enablement)?** Yes
163
+
#### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement?
164
+
165
+
Yes
166
+
167
+
#### What happens if we reenable the feature if it was previously rolled back?
161
168
162
-
***What happens if we reenable the feature if it was previously rolled back?**
163
169
Pods that run on that node will have memory backed volumes sized based on Linux
164
170
host default. The sizing may not align with actual available memory for an app.
165
171
166
-
***Are there any tests for feature enablement/disablement?**
172
+
#### Are there any tests for feature enablement/disablement?
173
+
167
174
No, testing behavior with the feature disabled is dependent on node operating
168
175
system configuration. The point of this KEP is to address that coupling.
169
176
170
177
### Rollout, Upgrade and Rollback Planning
171
178
172
-
***How can a rollout fail? Can it impact already running workloads?**
179
+
#### How can a rollout fail? Can it impact already running workloads?
180
+
173
181
If a pod has more allocatable memory than the default node instance behavior
174
182
of taking 50% node instance memory for sizing emptyDir, a pod could potentially
175
183
write more content to the empty dir volume than previously. This should have
176
184
no impact on rollout of the cluster or workload. In practice, applications
177
185
that did exhaust the size of the memory backed volume were not portable across
178
186
instance types or would have had to handle running out of room in that volume.
179
187
180
-
***What specific metrics should inform a rollback?**
188
+
#### What specific metrics should inform a rollback?
189
+
181
190
None.
182
191
183
-
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
192
+
#### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
193
+
184
194
I do not believe this is applicable.
185
195
186
-
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
187
-
fields of API types, flags, etc.?**
188
-
Even if applying deprecation policies, they may still surprise some users.
196
+
#### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
197
+
189
198
No.
190
199
191
200
### Monitoring Requirements
192
201
193
-
***How can an operator determine if the feature is in use by workloads?**
202
+
#### How can an operator determine if the feature is in use by workloads?
203
+
194
204
An operator can audit for pods whose emptyDir medium is memory and a size limit
195
205
is specified. It's not clear there is a benefit to track this because it only
196
206
impacts how the kubelet better enforces an existing API.
197
207
198
-
***What are the SLIs (Service Level Indicators) an operator can use to determine
199
-
the health of the service?**
208
+
#### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
209
+
200
210
This does not seem relevant to this feature.
201
211
202
-
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
212
+
#### How can someone using this feature know that it is working for their instance?
213
+
214
+
-[x] Other
215
+
- Details: An operator can audit for pods whose emptyDir medium is memory and a size limit
216
+
is specified.
217
+
218
+
#### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
219
+
203
220
This does not seem relevant to this feature.
204
221
205
-
***Are there any missing metrics that would be useful to have to improve observability
206
-
of this feature?**
222
+
#### Are there any missing metrics that would be useful to have to improve observability of this feature?
223
+
207
224
No.
208
225
209
226
### Dependencies
210
227
211
-
***Does this feature depend on any specific services running in the cluster?**
228
+
#### Does this feature depend on any specific services running in the cluster?
229
+
212
230
No
213
231
214
232
### Scalability
215
233
216
-
***Will enabling / using this feature result in any new API calls?**
234
+
#### Will enabling / using this feature result in any new API calls?
235
+
217
236
No.
218
237
219
-
***Will enabling / using this feature result in introducing new API types?**
238
+
#### Will enabling / using this feature result in introducing new API types?
239
+
240
+
No
241
+
242
+
#### Will enabling / using this feature result in any new calls to the cloud?
243
+
244
+
provider?
245
+
220
246
No
221
247
222
-
***Will enabling / using this feature result in any new calls to the cloud
223
-
provider?**
248
+
#### Will enabling / using this feature result in increasing size or count of the existing API objects?
249
+
224
250
No
225
251
226
-
***Will enabling / using this feature result in increasing size or count of
227
-
the existing API objects?**
252
+
#### Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?
253
+
228
254
No
229
255
230
-
***Will enabling / using this feature result in increasing time taken by any
231
-
operations covered by [existing SLIs/SLOs]?**
256
+
#### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
257
+
232
258
No
233
259
234
-
***Will enabling / using this feature result in non-negligible increase of
235
-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
260
+
#### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
261
+
236
262
No
237
263
238
264
### Troubleshooting
239
265
240
-
***How does this feature react if the API server and/or etcd is unavailable?**
266
+
#### How does this feature react if the API server and/or etcd is unavailable?
267
+
241
268
No impact.
242
269
243
-
***What are other known failure modes?**
270
+
#### What are other known failure modes?
271
+
244
272
Not applicable.
245
273
246
-
***What steps should be taken if SLOs are not being met to determine the problem?**
274
+
#### What steps should be taken if SLOs are not being met to determine the problem?
0 commit comments