Skip to content

Commit 1b5e812

Browse files
committed
Cost based scaling down of pods
1 parent af66ec9 commit 1b5e812

File tree

1 file changed

+382
-0
lines changed

1 file changed

+382
-0
lines changed
Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
---
2+
title: Cost based scaling down of pods
3+
authors:
4+
- "@ingvagabund"
5+
owning-sig: sig-scheduling
6+
participating-sigs:
7+
- sig-apps
8+
reviewers:
9+
- TBD
10+
approvers:
11+
- TBD
12+
editor: TBD
13+
creation-date: 2020-06-30
14+
last-updated: 2020-07-20
15+
status: provisional
16+
---
17+
18+
# Cost based scaling down of pods
19+
20+
## Table of Contents
21+
22+
<!-- toc -->
23+
- [Summary](#summary)
24+
- [Motivation](#motivation)
25+
- [Goals](#goals)
26+
- [Non-Goals](#non-goals)
27+
- [Proposal](#proposal)
28+
- [Examples of strategies](#examples-of-strategies)
29+
- [Balancing duplicates among topological domains](#balancing-duplicates-among-topological-domains)
30+
- [Pods not tolerating taints first](#pods-not-tolerating-taints-first)
31+
- [Minimizing pod anti-affinity](#minimizing-pod-anti-affinity)
32+
- [Rank normalization and weighted sum](#rank-normalization-and-weighted-sum)
33+
- [User Stories [optional]](#user-stories-optional)
34+
- [Story 1](#story-1)
35+
- [Story 2](#story-2)
36+
- [Story 3](#story-3)
37+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
38+
- [Phases](#phases)
39+
- [Option A (field in a pod status)](#option-a-field-in-a-pod-status)
40+
- [Option B (CRD for a pod group)](#option-b-crd-for-a-pod-group)
41+
- [Workflow example](#workflow-example)
42+
- [Risks and Mitigations](#risks-and-mitigations)
43+
- [Design Details](#design-details)
44+
- [Test Plan](#test-plan)
45+
- [Graduation Criteria](#graduation-criteria)
46+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
47+
- [Version Skew Strategy](#version-skew-strategy)
48+
- [Implementation History](#implementation-history)
49+
- [Alternatives [optional]](#alternatives-optional)
50+
<!-- /toc -->
51+
52+
## Summary
53+
54+
Cost ranking pods through an external component and scaling down pods based
55+
on the cost allows to employ various scheduling strategies to keep a cluster
56+
from diverging from an optimal distribution of resources.
57+
Providing an external solution for selecting the right victim allows to improve ability
58+
to preserve various conditions such us balancing pods among failure domains, keeping
59+
aligned with security requirements or respecting application policies.
60+
Allowing controllers to be free of any scheduling strategy, yet to be aware
61+
of impact of removing pods on the overall cluster scheduling plan, helps to reduce
62+
cost of re-scheduling resources.
63+
64+
## Motivation
65+
66+
Scaling down a set of pods does not always results in optimal selection of victims.
67+
The scheduler relies on filters and scores which may distribute the pods wrt. topology
68+
spreading and/or load balancing constraints (e.g. pods uniformly balanced among zones).
69+
Application specific workloads may prefer to scale down short-running pods and favor long-running pods.
70+
Selecting a victim with a trivial logic can unbalance the topology spreading
71+
or have jobs that accumulated work to be lost in vain.
72+
Given it's a natural property of a cluster to shift workloads in time,
73+
decision made by a scheduler at some time is as good as its ability to predict future demands.
74+
The default kubernetes scheduler was constructed with a goal to provide high throughput
75+
at the cost of being simple. Thus, it is quite easy to diverge from the scheduling plan.
76+
In contrast, descheduler allows to help to re-balance the plan and get closer to
77+
the scheduler constraints. Yet, it is designed to run and adjust the cluster periodically (e.g. each hour).
78+
Therefore, unusable for scaling down purposes (which require immediate action).
79+
80+
On the other hand each controller with a scale down operation has its own
81+
implementation of a victim selection logic.
82+
The decision making logic does not take into account a scheduling plan.
83+
Extending each such controller with additional logic to support various scheduling
84+
constraints is impractical. In cases a proprietary solution for scaling down is required,
85+
it's impossible. Also, controllers do not necessarily have a whole cluster overview
86+
so its decision does not have to be optimal.
87+
Therefore, it's more feasible to locate the logic outside of a controller.
88+
89+
In order to support more informed scaling down operation while keeping scheduling plan in mind,
90+
additional decision logic that can be extended based on applications requirements is needed.
91+
92+
### Goals
93+
94+
- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan
95+
- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted
96+
97+
### Non-Goals
98+
99+
- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off)
100+
101+
## Proposal
102+
103+
Proposed solution is to implement an optional cost-based component that will be watching all
104+
pods (or its subset) and nodes (potentially other objects) present in a cluster.
105+
Assigning each pod a cost based on a set of scheduling constraints.
106+
At the same time extending controllers logic to utilize the pod cost when selecting a victim during scale down operation.
107+
108+
The component will allow to select a different list of scheduling constraints for each targeted
109+
set of pods. Each pod in a set will be given a cost based on how much important it is in the set.
110+
The constraints can follow the same rules as the scheduler (through importing scheduling plugins)
111+
or be custom made (e.g. wrt. to application or proprietary requirements).
112+
The component will implement a mechanism for ranking pods.
113+
<!-- Either by annotating a pod, updating its status, setting a new field in pod's spec
114+
or creating a new CRD which will carry a cost. -->
115+
Each controller will have a choice to either ignore the cost or take it into account
116+
when scaling down.
117+
118+
This way, the logic for selecting a victim for the scaling down operation will be
119+
separated from each controller. Allowing each consumer to provide its own
120+
logic for assigning costs. Yet, having all controllers to consume the cost uniformly.
121+
122+
Given the default scheduler is not a source of truth about how a pod should be distributed
123+
after it was scheduled, scaling down strategies can exercise completely different approaches.
124+
125+
Examples of scheduling constraints:
126+
- choose pods running on a node which have a `PreferNoSchedule` taint first
127+
- choose youngest/oldest pods first
128+
- choose pods minimizing topology skew among failure domains (e.g. availability zones)
129+
130+
The goal of the proposal is not to provide specific strategies for more informed scaling down operation.
131+
The primary goal is to provide a mechanism and have controllers implement the mechanism.
132+
Allowing consumers of the new component to define their own strategies.
133+
134+
### Examples of strategies
135+
136+
Strategies can be divided into two categories:
137+
- scaling down/up a pod group does not require rank re-computation
138+
- scaling down/up a pod group requires rank re-computation
139+
140+
#### Balancing duplicates among topological domains
141+
142+
- Evict pods while minimizing skew between topological domains
143+
- Each pod can be given a cost based on how old/young it is in the same domain:
144+
- if a pod is the first one in the domain, rank the pod with cost `1`
145+
- if a pod was created second to the domain, rank the pod with cost `2`
146+
- continue this way until all pods in all domains are ranked
147+
- higher rank of a pod, the sooner the pod gets removed
148+
149+
#### Pods not tolerating taints first
150+
151+
- Evict pods that do not tolerate taints before pods that tolerate taints.
152+
- Each pod can be given a cost based on how many taints are not tolerated
153+
- higher rank of a pod, the sooner the pod gets removed
154+
155+
#### Minimizing pod anti-affinity
156+
157+
- Evict pods maximizing anti-affinity first
158+
- Pod that improves anti-affinity on a node gets higher rank
159+
- Given multiple pod groups can be part of anti-affinity group, scaling down
160+
a single pod in a group requires re-computation of pod ranks of all pods
161+
taking part. Also, only a single pod can be scaled down at a time.
162+
Otherwise, ranks no longer have to provide optimal victim selection.
163+
164+
In the provided examples the first two strategies do not require rank re-computation.
165+
166+
### Rank normalization and weighted sum
167+
168+
In order to allow pod ranking by multiple strategies/constraints, it's important
169+
to normalize ranks. On the other hand, rank normalization requires all strategies
170+
to re-compute all ranks every time a pod is created/deleted. To eliminate the need
171+
to re-compute, each strategy can introduce a threshold where every pod rank
172+
exceeding the threshold gets rounded to the threshold.
173+
E.g. if a topology domain has at least 10 pods, 11-th and other pods get the same
174+
rank as 10-th pod.
175+
With the threshold based normalization multiple strategies can rank a pod group
176+
which can be used to compute weighted rank through all relevant strategies.
177+
178+
### User Stories [optional]
179+
180+
#### Story 1
181+
182+
From [@pnovotnak](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-328685358):
183+
184+
```
185+
I have a number of scientific programs that I've wrapped with code to talk
186+
to a message broker that do not checkpoint state. The cost of deleting the resource
187+
increases over time (some of these tasks take hours), until it completes the current unit of work.
188+
189+
Choosing a pod by most idle resources would also work in my case.
190+
```
191+
192+
#### Story 2
193+
194+
From [@cpwood](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-436587548)
195+
196+
```
197+
For my use case, I'd prefer Kubernetes to choose its victims from pods that are running on nodes which have a PreferNoSchedule taint.
198+
```
199+
200+
#### Story 3
201+
202+
From [@barucoh](https://github.com/kubernetes/kubernetes/issues/89922)
203+
204+
```
205+
A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
206+
```
207+
208+
### Implementation Details/Notes/Constraints
209+
210+
Currently, the descheduler does not allow to immediately react on changes in a cluster.
211+
Yet, with some modification another instance of the descheduler (with different set of strategies)
212+
might be ran in watch mode and rank each pod as it comes.
213+
Also, once the scheduling framework gets migrated into its own repository,
214+
scheduling plugins can be vendored as well to provide some of the core scheduling logic.
215+
216+
The pod ranking is best-effort so in case a controller is to delete more than one pod
217+
it selects all the pods with the highest cost and remove those.
218+
In case a pod fails to be deleted during the scale down operation and results in resuming the operation in the next cycle,
219+
it may happen pods get ranked differently and a different set of victim pods gets selected.
220+
221+
Once a pod is removed, ranks of others pods might be required to get re-computed.
222+
Unless strategies that do not require re-computation are deployed.
223+
By default, all pods owned by a controller template has to be ranked.
224+
Otherwise, a controller falls back to each original selection victim logic.
225+
Resp. it can be configured to wait or back-off.
226+
Also, the ranking strategies can be configured to target only selected sets of pods.
227+
Thus, allowing a controller to employ cost based selection only when more sophisticated
228+
logic is required and available.
229+
230+
During alpha phase, each controller utilizing the pod ranking will feature gate the new logic.
231+
Starting by utilizing a pod annotation (e.g. `scheduling.alpha.kubernetes.io/cost`)
232+
which can be eventually promoted to either a field in pod's spec or moved under CRD (see further).
233+
234+
If strategies requiring rank re-computation are employed, it's more practical to define
235+
a CRD for a pod group and have all the costs in a single place to avoid desynchronization
236+
of ranks among pods.
237+
238+
#### Phases
239+
240+
Phase 1:
241+
- add support for strategies which do not need rank re-computation of a pod group
242+
- only a single strategy can be ran to rank pods (unless threshold based normalization is applied)
243+
- use annotations to hold a single pod cost
244+
245+
Phase 2A:
246+
- promote pod cost annotation to a pod status field
247+
- no synchronization of pods in a pod group, harder support of strategies which require rank re-computation
248+
249+
Phase 2B:
250+
- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks)
251+
- add support for strategies which require rank re-computation
252+
253+
### Option A (field in a pod status)
254+
255+
Store a post cost/rank under pod's status so it can be updated only by component
256+
who has permission to update pod status.
257+
258+
```go
259+
// PodStatus represents information about the status of a pod. Status may trail the actual
260+
// state of a system, especially if the node that hosts the pod cannot contact the control
261+
// plane.
262+
type PodStatus struct {
263+
...
264+
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
265+
// +optional
266+
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`
267+
...
268+
}
269+
```
270+
271+
Very simple, reading the field directly from a pod status.
272+
No additional apimachinery logic.
273+
274+
### Option B (CRD for a pod group)
275+
276+
```yaml
277+
apiVersion: scheduling.k8s.io/v1alpha1
278+
kind: PodGroupCost
279+
metadata:
280+
name: rc-guestbook-fronted
281+
namespace: rc-guestbook-fronted-namespace
282+
spec:
283+
owner:
284+
kind: ReplicationController
285+
name: rc-guestbook-fronted // may be redundant
286+
costs:
287+
"rc-guestbook-fronted-pod1": 4
288+
"rc-guestbook-fronted-pod2": 8
289+
...
290+
"rc-guestbook-fronted-podn": 2
291+
```
292+
293+
More suitable for keeping all pod costs from a pod group in sync.
294+
Controllers will need to take into account the new CRD (adding informers).
295+
A CR will live in the same namespace as underlying pod group (RC, RC, etc.).
296+
297+
### Workflow example
298+
299+
**Scenario**: pod group of 12 pods, 3 AZs (2 nodes per each AZ), pods are evenly spread among all zones
300+
301+
1. Assuming a pod group is supposed to respect topology spreading and scale down
302+
operation is to minimize topology skew between domains.
303+
1. **Ranking component**: The component is configured to rank pods based on their presence in a topology domain
304+
1. **Ranking component**: The component notices the pods, analyzes the pod group and ranks the pods in the following manner (`{PodName: Rank}`):
305+
- AZ1: {P1: 1, P2: 2, P3: 3, P4: 4} (P1 getting 1 as it was created first in the domain, P2 getting 2, etc.)
306+
- AZ2: {P5: 1, P6: 2, P7: 3, P8: 4}
307+
- AZ3: {P9: 1, P10: 2, P11: 3, P12: 4}
308+
1. **A controller**: Scale down operation of the pod group is requested
309+
1. **A controller**: Scale down logic of a controller selects one of P4, P8 or P12 as a victim (e.g. P8)
310+
1. Topology skew is now `1`
311+
1. **Ranking component**: No need to re-compute ranks since the ranking does not depend on the pod group size
312+
1. **A controller**: Scaling down one more time selects one of {P4, P12}
313+
1. Topology skew is still `1`
314+
315+
### Risks and Mitigations
316+
317+
It may happen the ranking component does not rank all relevant pods in time.
318+
In that case a controller can either choose to ignore the cost. Or, it can back-off
319+
with a configurable timeout and retry the scale down operation once all pods in
320+
a given set are ranked.
321+
322+
From the security perspective a malicious code might assign pod a different cost
323+
with a goal to remove more vital pods to harm a running application.
324+
How much is using annotation safe? Might be better to use pod status
325+
so only clients with pod/status update RBAC are allowed to change the cost.
326+
327+
In case a strategy needs to re-compute costs after scale down operation and
328+
the component stops working (for any reason), a controller might scale down
329+
incorrect pod(s) in the next request. More reasons to constraint strategies
330+
to not need to re-compute pod costs.
331+
332+
In case a scaling down process is too quick, the component may be too slow to
333+
recompute all scores and provide suboptimal/incorrect costs.
334+
335+
In case two or more controllers own a pod group (through labels), scaling down the group by one
336+
controller can result in scale up the same group by another controller.
337+
Entering an endless loop of scaling up and down. Which may result in unexpected
338+
behavior. Leaving a subgroup of pods unranked.
339+
340+
Deployment upgrades might have different expectations when exercising a rolling update.
341+
They could just completely ignore the costs. Unless, it's acceptable to scale down by one
342+
and wait until the costs are recomputed when needed.
343+
344+
## Design Details
345+
346+
### Test Plan
347+
348+
**Scaling down respects pod ranking**:
349+
In the simplest case the component ranks pods in a group.
350+
The goal is to validate all pods are scaled down in an order
351+
respecting ranks of all pods in a pod group.
352+
353+
**A controller ignores ranks if at least one pod is missing a rank**:
354+
Testing the case where not every pod in a pod group is ranked.
355+
A controller falls down to its original behavior if not all pods
356+
in a pod group are ranked after specified timeout (back-off simulation).
357+
358+
**In case a strategy requiring re-computation after a pod group size changed**:
359+
- a controller will not scale down by two (only by one) replicas
360+
- a controller will not scale down by one until pod ranks are re-computed after previous scale down operation
361+
362+
### Graduation Criteria
363+
364+
- Alpha: Initial support for taking pod cost into account when scaling down in controllers. Disabled by default.
365+
- Beta: Enabled by default
366+
367+
### Upgrade / Downgrade Strategy
368+
369+
Scaling down based on a pod cost is optional. If no cost is present, scaling down falls back to the original behavior.
370+
371+
### Version Skew Strategy
372+
373+
A controller either recognizes pod's cost or it does not.
374+
375+
## Implementation History
376+
377+
- KEP Started on 06/30/2020
378+
379+
## Alternatives [optional]
380+
381+
- Controllers might use a webhook and talk to the component directly to select a victim
382+
- Some controllers might improve their decision logic to cover specific use cases (e.g. introduce new policy for sorting pods based on information located in pod objects)

0 commit comments

Comments
 (0)