Skip to content

Commit 16d8044

Browse files
committed
(squash) feedback
1 parent db57fbd commit 16d8044

File tree

1 file changed

+13
-22
lines changed
  • keps/sig-apps/2185-random-pod-select-on-replicaset-downscale

1 file changed

+13
-22
lines changed

keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md

Lines changed: 13 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,9 @@
77
- [Goals](#goals)
88
- [Non-Goals](#non-goals)
99
- [Proposal](#proposal)
10-
- [User Stories (Optional)](#user-stories-optional)
10+
- [User Stories](#user-stories)
1111
- [Story 1](#story-1)
12-
- [Story 2](#story-2)
13-
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
12+
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
1413
- [Risks and Mitigations](#risks-and-mitigations)
1514
- [Design Details](#design-details)
1615
- [Test Plan](#test-plan)
@@ -27,7 +26,8 @@
2726
- [Implementation History](#implementation-history)
2827
- [Drawbacks](#drawbacks)
2928
- [Alternatives](#alternatives)
30-
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
29+
- [Make downscale heuristic an option](#make-downscale-heuristic-an-option)
30+
- [Compare pods using their distribution in the failure domains](#compare-pods-using-their-distribution-in-the-failure-domains)
3131
<!-- /toc -->
3232

3333
## Release Signoff Checklist
@@ -96,22 +96,11 @@ and how a randomized approach solves the issue.
9696
This story shows an imbalance cycle after a failure domain fails or gets
9797
upgraded.
9898

99-
1. Assume a ReplicaSet has 3N pods evenly distributed across 3 failure domains,
99+
1. Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains,
100100
thus each has N pods.
101-
2. A failure or an upgrade happens in one of the domains. The N pods from this
102-
domain get re-scheduled into the other 2 domains. Note that this N pods are
103-
now the youngest.
104-
3. The domain recovers or finishes upgrading.
105-
4. ReplicaSet is downscaled to 2N, due to user action or HPA recommendation.
106-
Given the downscaling algorithm, 2 domains end up with N nodes each, the 2N
107-
Pods that were never restarted, and the remaining domain has 0 Pods.
108-
There is nothing to be done here. A random approach would obtain the same
109-
result.
110-
5. The ReplicaSet is upscaled to 3N again, due to user action or HPA
111-
recommendation. Due to Pod spreading during scheduling, each domain has N
112-
Pods. Balance is recovered. However, one failure domain holds the youngest
113-
Pods.
114-
6. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all
101+
2. An upgrade happens adding a new available domain and the ReplicaSet is upscaled
102+
to 3N. The new domain now holds all the youngest pods due to scheduler spreading.
103+
3. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all
115104
the Pods from one domain are removed, leading to imbalance. The situation
116105
doesn't improve with repeated upscale and downscale steps. Instead, a
117106
randomized approach leaves about 2/3*N nodes in each
@@ -155,14 +144,16 @@ there are a number of reasons why we don't need to preserve such behavior as is:
155144
We propose a randomized approach to the algorithm for Pod victim selection
156145
during ReplicaSet downscale:
157146

158-
1. Do a random shuffle of ReplicaSet Pods.
147+
1. Sort ReplicaSet pods by pod UUID.
159148
2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815)
160149
2. Call sorting algorithm with a modified time comparison for start and
161150
creation timestamp.
162151

152+
163153
Instead of directly comparing timestamps, the algorithm compares the elapsed
164-
times since the timestamp until the current time but in a logarithmic scale,
165-
floor rounded. This has the effect of treating elapsed times as equals when they
154+
times since the creation and ready timestamps until the current time but in a
155+
logarithmic scale, floor rounded. These serve as sorting criteria.
156+
This has the effect of treating elapsed times as equals when they
166157
have the same scale. That is, Pods that have been running for a few nanoseconds
167158
are equal, but they are different from pods that have been running for a few
168159
seconds or a few days.

0 commit comments

Comments
 (0)