Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
382 changes: 382 additions & 0 deletions keps/sig-scheduling/20200630-cost-based-scaling-down-of-pods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,382 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are using the old template. Please update

title: Cost based scaling down of pods
authors:
- "@ingvagabund"
owning-sig: sig-scheduling
participating-sigs:
- sig-apps
reviewers:
- TBD
approvers:
- TBD
editor: TBD
creation-date: 2020-06-30
last-updated: 2020-07-20
status: provisional
---

# Cost based scaling down of pods

## Table of Contents

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Examples of strategies](#examples-of-strategies)
- [Balancing duplicates among topological domains](#balancing-duplicates-among-topological-domains)
- [Pods not tolerating taints first](#pods-not-tolerating-taints-first)
- [Minimizing pod anti-affinity](#minimizing-pod-anti-affinity)
- [Rank normalization and weighted sum](#rank-normalization-and-weighted-sum)
- [User Stories [optional]](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Story 3](#story-3)
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Phases](#phases)
- [Option A (field in a pod status)](#option-a-field-in-a-pod-status)
- [Option B (CRD for a pod group)](#option-b-crd-for-a-pod-group)
- [Workflow example](#workflow-example)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Implementation History](#implementation-history)
- [Alternatives [optional]](#alternatives-optional)
<!-- /toc -->

## Summary

Cost ranking pods through an external component and scaling down pods based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to be more explicit on what the current state is vs what is being proposed.

on the cost allows to employ various scheduling strategies to keep a cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descheduling/eviction strategies?

from diverging from an optimal distribution of resources.
Providing an external solution for selecting the right victim allows to improve ability
to preserve various conditions such us balancing pods among failure domains, keeping
aligned with security requirements or respecting application policies.
Allowing controllers to be free of any scheduling strategy, yet to be aware
of impact of removing pods on the overall cluster scheduling plan, helps to reduce
cost of re-scheduling resources.

## Motivation

Scaling down a set of pods does not always results in optimal selection of victims.
The scheduler relies on filters and scores which may distribute the pods wrt. topology
spreading and/or load balancing constraints (e.g. pods uniformly balanced among zones).
Application specific workloads may prefer to scale down short-running pods and favor long-running pods.
Selecting a victim with a trivial logic can unbalance the topology spreading
or have jobs that accumulated work to be lost in vain.
Given it's a natural property of a cluster to shift workloads in time,
decision made by a scheduler at some time is as good as its ability to predict future demands.
The default kubernetes scheduler was constructed with a goal to provide high throughput
at the cost of being simple. Thus, it is quite easy to diverge from the scheduling plan.
In contrast, descheduler allows to help to re-balance the plan and get closer to
the scheduler constraints. Yet, it is designed to run and adjust the cluster periodically (e.g. each hour).
Therefore, unusable for scaling down purposes (which require immediate action).

On the other hand each controller with a scale down operation has its own
implementation of a victim selection logic.
The decision making logic does not take into account a scheduling plan.
Extending each such controller with additional logic to support various scheduling
constraints is impractical. In cases a proprietary solution for scaling down is required,
it's impossible. Also, controllers do not necessarily have a whole cluster overview
so its decision does not have to be optimal.
Therefore, it's more feasible to locate the logic outside of a controller.

In order to support more informed scaling down operation while keeping scheduling plan in mind,
additional decision logic that can be extended based on applications requirements is needed.

### Goals

- Controllers with scale down operation are allowed to select a victim while still respecting a scheduling plan
- External component is available that can rank pods based on how much they diverge from a scheduling plan when deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be goals for the KEP. Is the plan to just provide the base binary and the framework or is it also to provide strategies A, B and C.


### Non-Goals

- Allow to employ strategies that require cost re-computation after scaling up/down (with a support from controllers, e.g. backing-off)

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the high level, I think this is a useful feature for various types of workloads (batch or serving) and placement strategies (spreading, binpacking etc.).


Proposed solution is to implement an optional cost-based component that will be watching all
pods (or its subset) and nodes (potentially other objects) present in a cluster.
Assigning each pod a cost based on a set of scheduling constraints.
At the same time extending controllers logic to utilize the pod cost when selecting a victim during scale down operation.

The component will allow to select a different list of scheduling constraints for each targeted
set of pods. Each pod in a set will be given a cost based on how much important it is in the set.
The constraints can follow the same rules as the scheduler (through importing scheduling plugins)
or be custom made (e.g. wrt. to application or proprietary requirements).
The component will implement a mechanism for ranking pods.
<!-- Either by annotating a pod, updating its status, setting a new field in pod's spec
or creating a new CRD which will carry a cost. -->
Each controller will have a choice to either ignore the cost or take it into account
when scaling down.

This way, the logic for selecting a victim for the scaling down operation will be
separated from each controller. Allowing each consumer to provide its own
logic for assigning costs. Yet, having all controllers to consume the cost uniformly.

Given the default scheduler is not a source of truth about how a pod should be distributed
after it was scheduled, scaling down strategies can exercise completely different approaches.

Examples of scheduling constraints:
- choose pods running on a node which have a `PreferNoSchedule` taint first
- choose youngest/oldest pods first
- choose pods minimizing topology skew among failure domains (e.g. availability zones)
Comment on lines +125 to +128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean the component will provide different sets of "cost" according to different strategies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, each strategy will provide a different set of costs for a pod group. If configured and implemented properly, suitable weighted sum of ranks can give a final rank as is done in scheduler when scoring multiple nodes today.


The goal of the proposal is not to provide specific strategies for more informed scaling down operation.
The primary goal is to provide a mechanism and have controllers implement the mechanism.
Allowing consumers of the new component to define their own strategies.

### Examples of strategies

Strategies can be divided into two categories:
- scaling down/up a pod group does not require rank re-computation
- scaling down/up a pod group requires rank re-computation

#### Balancing duplicates among topological domains

- Evict pods while minimizing skew between topological domains
- Each pod can be given a cost based on how old/young it is in the same domain:
- if a pod is the first one in the domain, rank the pod with cost `1`
- if a pod was created second to the domain, rank the pod with cost `2`
- continue this way until all pods in all domains are ranked
- higher rank of a pod, the sooner the pod gets removed

#### Pods not tolerating taints first

- Evict pods that do not tolerate taints before pods that tolerate taints.
- Each pod can be given a cost based on how many taints are not tolerated
- higher rank of a pod, the sooner the pod gets removed

#### Minimizing pod anti-affinity

- Evict pods maximizing anti-affinity first
- Pod that improves anti-affinity on a node gets higher rank
- Given multiple pod groups can be part of anti-affinity group, scaling down
a single pod in a group requires re-computation of pod ranks of all pods
taking part. Also, only a single pod can be scaled down at a time.
Otherwise, ranks no longer have to provide optimal victim selection.

In the provided examples the first two strategies do not require rank re-computation.

### Rank normalization and weighted sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that multiple controllers could influence the cost of evicting a pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the assumption that the cost of a pod is updated by a single controller, the implementation detail of that controller is not relevant in this KEP I think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the assumption that the cost of a pod is updated by a single controller

Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?

I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking something. In this proposal the cost of Pod eviction is part of its actual state (and not the .spec for the Pod). Is that right?

Yes, it would be part of pod.Status.

I'd expect eviction cost to be calculated by one thing using lots of plugins (much like how scoring works for scheduling), and that one thing would form part of the control plane.

I feel that the implementation detail of the controller updating the cost is not important here. I don't think this KEP needs to take a stand or promote implementing a controller in k/k. Such a controller warrants its own KEP if we feel we need one in-tree.


In order to allow pod ranking by multiple strategies/constraints, it's important
to normalize ranks. On the other hand, rank normalization requires all strategies
to re-compute all ranks every time a pod is created/deleted. To eliminate the need
to re-compute, each strategy can introduce a threshold where every pod rank
exceeding the threshold gets rounded to the threshold.
E.g. if a topology domain has at least 10 pods, 11-th and other pods get the same
rank as 10-th pod.
With the threshold based normalization multiple strategies can rank a pod group
which can be used to compute weighted rank through all relevant strategies.

### User Stories [optional]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@ingvagabund ingvagabund Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/kubernetes#89922

Thanks!!!

kubernetes/kubernetes#88809

This one is a use case for the descheduler directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to serve the different stories? Different people want different strategies?

Is it going to be a configuration for the controller or something that Pods can define?

If that's something we don't want to discuss in this KEP, it should be added to Non-goal.


#### Story 1

From [@pnovotnak](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-328685358):

```
I have a number of scientific programs that I've wrapped with code to talk
to a message broker that do not checkpoint state. The cost of deleting the resource
increases over time (some of these tasks take hours), until it completes the current unit of work.

Choosing a pod by most idle resources would also work in my case.
```

#### Story 2

From [@cpwood](https://github.com/kubernetes/kubernetes/issues/4301#issuecomment-436587548)

```
For my use case, I'd prefer Kubernetes to choose its victims from pods that are running on nodes which have a PreferNoSchedule taint.
```

#### Story 3

From [@barucoh](https://github.com/kubernetes/kubernetes/issues/89922)

```
A deployment with 3 replicas with anti-affinity rule to spread across 2 AZs scaled down to 2 replicas in only 1 AZ.
```

### Implementation Details/Notes/Constraints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should center the discussion here around descheduler. What we need to propose is an abstract concept of eviction cost, with the cost being comparable among the pods that belong to the same group/collection (have the same owner ref for example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the intent of the KEP, it should be specified in Goals/Non-Goals


Currently, the descheduler does not allow to immediately react on changes in a cluster.
Yet, with some modification another instance of the descheduler (with different set of strategies)
might be ran in watch mode and rank each pod as it comes.
Also, once the scheduling framework gets migrated into its own repository,
scheduling plugins can be vendored as well to provide some of the core scheduling logic.

The pod ranking is best-effort so in case a controller is to delete more than one pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "Delete more than one pod" would be a common req, "selects all the pods with the smallest cost and remove" may flip and imbalance situation easily. So is it possible to ensure the "cost" design to support "Delete X pods in a batch"? No need to be perfect in the beginning, but keep the batch-deletion req in mind from day 1 is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few words on this topic. There are two categories of strategies defined. One category where there's no need to recompute costs. So it's safe to scale down with 2 or more replicas at the same time. Other category where each time a pod is added/removed, all the costs need to be re-computed. I am mainly focusing on the first category to provide basic functionality. With a space to discuss the second category.

it selects all the pods with the highest cost and remove those.
In case a pod fails to be deleted during the scale down operation and results in resuming the operation in the next cycle,
it may happen pods get ranked differently and a different set of victim pods gets selected.

Once a pod is removed, ranks of others pods might be required to get re-computed.
Unless strategies that do not require re-computation are deployed.
By default, all pods owned by a controller template has to be ranked.
Otherwise, a controller falls back to each original selection victim logic.
Resp. it can be configured to wait or back-off.
Also, the ranking strategies can be configured to target only selected sets of pods.
Thus, allowing a controller to employ cost based selection only when more sophisticated
logic is required and available.

During alpha phase, each controller utilizing the pod ranking will feature gate the new logic.
Starting by utilizing a pod annotation (e.g. `scheduling.alpha.kubernetes.io/cost`)
which can be eventually promoted to either a field in pod's spec or moved under CRD (see further).

If strategies requiring rank re-computation are employed, it's more practical to define
a CRD for a pod group and have all the costs in a single place to avoid desynchronization
of ranks among pods.

#### Phases

Phase 1:
- add support for strategies which do not need rank re-computation of a pod group
- only a single strategy can be ran to rank pods (unless threshold based normalization is applied)
- use annotations to hold a single pod cost

Phase 2A:
- promote pod cost annotation to a pod status field
- no synchronization of pods in a pod group, harder support of strategies which require rank re-computation

Phase 2B:
- use a CRD to hold costs of all pods in a pod group (to synchronize re-computation of ranks)
- add support for strategies which require rank re-computation

### Option A (field in a pod status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this option, I don't think we need more than that, this is a simple and abstract concept will be easy to reason about.


Store a post cost/rank under pod's status so it can be updated only by component
who has permission to update pod status.

```go
// PodStatus represents information about the status of a pod. Status may trail the actual
// state of a system, especially if the node that hosts the pod cannot contact the control
// plane.
type PodStatus struct {
...
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-cost
// +optional
Cost int `json:"cost,omitempty" protobuf:"bytes,...,opt,name=cost"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine by default the controller (e.g., the ReplicaSet controller) that creates the pods would set this to the highest conceptual value for new pods because we assume the scheduler will make the best placement possible. Another controller would be responsible for updating the cost to something lower at runtime based on cluster state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it feasible to have the scheduler record its actual opinion of placement, or to leave room for adding that?

For example: a three-zone cluster, even pod spreading, scaling from 2 to 3. The scheduler places the Pod in the zone with no replica and scores this as “I could not have done better”.

Now schedule 3 more Pods. The first two go leaving this.

  • node a (zone 1): 1 DaemonSet Pod, 2 app Pods
  • node b (zone 2): 1 DaemonSet Pod, 1 app Pod, 1 large cluster critical Pod
  • node c (zone 3): 1 DaemonSet Pod, 2 app Pods

The scheduler considers the 3rd Pod and sees it doesn't fit on Node b. The scheduler picks node c. The scheduler also records that the scheduling involved a weighted trade-off that is scored at (say) 0.72, because the Pods for the app are not evenly spread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the word "cost" very confusing for the intend of this proposal. I think "Penalty" would be more appropriate. Then, that makes it more clear that a 0 penalty means less chance of being evicted.

...
}
```

Very simple, reading the field directly from a pod status.
No additional apimachinery logic.

### Option B (CRD for a pod group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this CRD adding value compared to option 1.


```yaml
apiVersion: scheduling.k8s.io/v1alpha1
kind: PodGroupCost
metadata:
name: rc-guestbook-fronted
namespace: rc-guestbook-fronted-namespace
spec:
owner:
kind: ReplicationController
name: rc-guestbook-fronted // may be redundant
costs:
"rc-guestbook-fronted-pod1": 4
"rc-guestbook-fronted-pod2": 8
...
"rc-guestbook-fronted-podn": 2
```

More suitable for keeping all pod costs from a pod group in sync.
Controllers will need to take into account the new CRD (adding informers).
A CR will live in the same namespace as underlying pod group (RC, RC, etc.).

### Workflow example

**Scenario**: pod group of 12 pods, 3 AZs (2 nodes per each AZ), pods are evenly spread among all zones

1. Assuming a pod group is supposed to respect topology spreading and scale down
operation is to minimize topology skew between domains.
1. **Ranking component**: The component is configured to rank pods based on their presence in a topology domain
1. **Ranking component**: The component notices the pods, analyzes the pod group and ranks the pods in the following manner (`{PodName: Rank}`):
- AZ1: {P1: 1, P2: 2, P3: 3, P4: 4} (P1 getting 1 as it was created first in the domain, P2 getting 2, etc.)
- AZ2: {P5: 1, P6: 2, P7: 3, P8: 4}
- AZ3: {P9: 1, P10: 2, P11: 3, P12: 4}
1. **A controller**: Scale down operation of the pod group is requested
1. **A controller**: Scale down logic of a controller selects one of P4, P8 or P12 as a victim (e.g. P8)
1. Topology skew is now `1`
1. **Ranking component**: No need to re-compute ranks since the ranking does not depend on the pod group size
1. **A controller**: Scaling down one more time selects one of {P4, P12}
1. Topology skew is still `1`

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate the risks with a numbered list?


It may happen the ranking component does not rank all relevant pods in time.
In that case a controller can either choose to ignore the cost. Or, it can back-off
with a configurable timeout and retry the scale down operation once all pods in
a given set are ranked.

From the security perspective a malicious code might assign pod a different cost
with a goal to remove more vital pods to harm a running application.
How much is using annotation safe? Might be better to use pod status
so only clients with pod/status update RBAC are allowed to change the cost.

In case a strategy needs to re-compute costs after scale down operation and
the component stops working (for any reason), a controller might scale down
incorrect pod(s) in the next request. More reasons to constraint strategies
to not need to re-compute pod costs.

In case a scaling down process is too quick, the component may be too slow to
recompute all scores and provide suboptimal/incorrect costs.

In case two or more controllers own a pod group (through labels), scaling down the group by one
controller can result in scale up the same group by another controller.
Entering an endless loop of scaling up and down. Which may result in unexpected
behavior. Leaving a subgroup of pods unranked.

Deployment upgrades might have different expectations when exercising a rolling update.
They could just completely ignore the costs. Unless, it's acceptable to scale down by one
and wait until the costs are recomputed when needed.

## Design Details

### Test Plan

**Scaling down respects pod ranking**:
In the simplest case the component ranks pods in a group.
The goal is to validate all pods are scaled down in an order
respecting ranks of all pods in a pod group.

**A controller ignores ranks if at least one pod is missing a rank**:
Testing the case where not every pod in a pod group is ranked.
A controller falls down to its original behavior if not all pods
in a pod group are ranked after specified timeout (back-off simulation).

**In case a strategy requiring re-computation after a pod group size changed**:
- a controller will not scale down by two (only by one) replicas
- a controller will not scale down by one until pod ranks are re-computed after previous scale down operation

### Graduation Criteria

- Alpha: Initial support for taking pod cost into account when scaling down in controllers. Disabled by default.
- Beta: Enabled by default

### Upgrade / Downgrade Strategy

Scaling down based on a pod cost is optional. If no cost is present, scaling down falls back to the original behavior.

### Version Skew Strategy

A controller either recognizes pod's cost or it does not.

## Implementation History

- KEP Started on 06/30/2020

## Alternatives [optional]

- Controllers might use a webhook and talk to the component directly to select a victim
- Some controllers might improve their decision logic to cover specific use cases (e.g. introduce new policy for sorting pods based on information located in pod objects)