|
7 | 7 | - [Goals](#goals) |
8 | 8 | - [Non-Goals](#non-goals) |
9 | 9 | - [Proposal](#proposal) |
10 | | - - [User Stories (Optional)](#user-stories-optional) |
| 10 | + - [User Stories](#user-stories) |
11 | 11 | - [Story 1](#story-1) |
12 | | - - [Story 2](#story-2) |
13 | | - - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) |
| 12 | + - [Notes/Constraints/Caveats](#notesconstraintscaveats) |
14 | 13 | - [Risks and Mitigations](#risks-and-mitigations) |
15 | 14 | - [Design Details](#design-details) |
16 | 15 | - [Test Plan](#test-plan) |
|
27 | 26 | - [Implementation History](#implementation-history) |
28 | 27 | - [Drawbacks](#drawbacks) |
29 | 28 | - [Alternatives](#alternatives) |
30 | | -- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) |
| 29 | + - [Make downscale heuristic an option](#make-downscale-heuristic-an-option) |
| 30 | + - [Compare pods using their distribution in the failure domains](#compare-pods-using-their-distribution-in-the-failure-domains) |
31 | 31 | <!-- /toc --> |
32 | 32 |
|
33 | 33 | ## Release Signoff Checklist |
@@ -96,22 +96,11 @@ and how a randomized approach solves the issue. |
96 | 96 | This story shows an imbalance cycle after a failure domain fails or gets |
97 | 97 | upgraded. |
98 | 98 |
|
99 | | -1. Assume a ReplicaSet has 3N pods evenly distributed across 3 failure domains, |
| 99 | +1. Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains, |
100 | 100 | thus each has N pods. |
101 | | -2. A failure or an upgrade happens in one of the domains. The N pods from this |
102 | | - domain get re-scheduled into the other 2 domains. Note that this N pods are |
103 | | - now the youngest. |
104 | | -3. The domain recovers or finishes upgrading. |
105 | | -4. ReplicaSet is downscaled to 2N, due to user action or HPA recommendation. |
106 | | - Given the downscaling algorithm, 2 domains end up with N nodes each, the 2N |
107 | | - Pods that were never restarted, and the remaining domain has 0 Pods. |
108 | | - There is nothing to be done here. A random approach would obtain the same |
109 | | - result. |
110 | | -5. The ReplicaSet is upscaled to 3N again, due to user action or HPA |
111 | | - recommendation. Due to Pod spreading during scheduling, each domain has N |
112 | | - Pods. Balance is recovered. However, one failure domain holds the youngest |
113 | | - Pods. |
114 | | -6. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all |
| 101 | +2. An upgrade happens adding a new available domain and the ReplicaSet is upscaled |
| 102 | + to 3N. The new domain now holds all the youngest pods due to scheduler spreading. |
| 103 | +3. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all |
115 | 104 | the Pods from one domain are removed, leading to imbalance. The situation |
116 | 105 | doesn't improve with repeated upscale and downscale steps. Instead, a |
117 | 106 | randomized approach leaves about 2/3*N nodes in each |
@@ -155,14 +144,16 @@ there are a number of reasons why we don't need to preserve such behavior as is: |
155 | 144 | We propose a randomized approach to the algorithm for Pod victim selection |
156 | 145 | during ReplicaSet downscale: |
157 | 146 |
|
158 | | -1. Do a random shuffle of ReplicaSet Pods. |
| 147 | +1. Sort ReplicaSet pods by pod UUID. |
159 | 148 | 2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815) |
160 | 149 | 2. Call sorting algorithm with a modified time comparison for start and |
161 | 150 | creation timestamp. |
162 | 151 |
|
| 152 | + |
163 | 153 | Instead of directly comparing timestamps, the algorithm compares the elapsed |
164 | | -times since the timestamp until the current time but in a logarithmic scale, |
165 | | -floor rounded. This has the effect of treating elapsed times as equals when they |
| 154 | +times since the creation and ready timestamps until the current time but in a |
| 155 | +logarithmic scale, floor rounded. These serve as sorting criteria. |
| 156 | +This has the effect of treating elapsed times as equals when they |
166 | 157 | have the same scale. That is, Pods that have been running for a few nanoseconds |
167 | 158 | are equal, but they are different from pods that have been running for a few |
168 | 159 | seconds or a few days. |
|
0 commit comments