(squash) feedback

damemi · damemi · commit 16d8044962bc · 2021-01-14T10:37:34.000-05:00
diff --git a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
@@ -7,10 +7,9 @@
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
-  - [User Stories (Optional)](#user-stories-optional)
+  - [User Stories](#user-stories)
     - [Story 1](#story-1)
-    - [Story 2](#story-2)
-  - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
+  - [Notes/Constraints/Caveats](#notesconstraintscaveats)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Test Plan](#test-plan)
@@ -27,7 +26,8 @@
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
-- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+  - [Make downscale heuristic an option](#make-downscale-heuristic-an-option)
+  - [Compare pods using their distribution in the failure domains](#compare-pods-using-their-distribution-in-the-failure-domains)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -96,22 +96,11 @@ and how a randomized approach solves the issue.
 This story shows an imbalance cycle after a failure domain fails or gets
 upgraded.
 
-1. Assume a ReplicaSet has 3N pods evenly distributed across 3 failure domains,
+1. Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains,
    thus each has N pods.
-2. A failure or an upgrade happens in one of the domains. The N pods from this
-   domain get re-scheduled into the other 2 domains. Note that this N pods are
-   now the youngest.
-3. The domain recovers or finishes upgrading.
-4. ReplicaSet is downscaled to 2N, due to user action or HPA recommendation.
-   Given the downscaling algorithm, 2 domains end up with N nodes each, the 2N
-   Pods that were never restarted, and the remaining domain has 0 Pods.
-   There is nothing to be done here. A random approach would obtain the same
-   result.
-5. The ReplicaSet is upscaled to 3N again, due to user action or HPA
-   recommendation. Due to Pod spreading during scheduling, each domain has N
-   Pods. Balance is recovered. However, one failure domain holds the youngest
-   Pods.
-6. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all
+2. An upgrade happens adding a new available domain and the ReplicaSet is upscaled
+   to 3N. The new domain now holds all the youngest pods due to scheduler spreading.
+3. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all
    the Pods from one domain are removed, leading to imbalance. The situation
    doesn't improve with repeated upscale and downscale steps. Instead, a
    randomized approach leaves about 2/3*N nodes in each
@@ -155,14 +144,16 @@ there are a number of reasons why we don't need to preserve such behavior as is:
 We propose a randomized approach to the algorithm for Pod victim selection
 during ReplicaSet downscale:
 
-1. Do a random shuffle of ReplicaSet Pods.
+1. Sort ReplicaSet pods by pod UUID.
 2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815)
 2. Call sorting algorithm with a modified time comparison for start and
    creation timestamp.
 
+
 Instead of directly comparing timestamps, the algorithm compares the elapsed
-times since the timestamp until the current time but in a logarithmic scale,
-floor rounded. This has the effect of treating elapsed times as equals when they
+times since the creation and ready timestamps until the current time but in a
+logarithmic scale, floor rounded. These serve as sorting criteria.
+This has the effect of treating elapsed times as equals when they
 have the same scale. That is, Pods that have been running for a few nanoseconds
 are equal, but they are different from pods that have been running for a few
 seconds or a few days.