- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.6k
Add KEP for random ReplicaSet downscale #2233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @alculquicondor @ahg-g @kubernetes/sig-apps-feature-requests
2522b5f    to
    0ed4d54      
    Compare
  
    | Don't mark this as "fixes". The issue remains open until graduation to GA. | 
| @alculquicondor thanks, updated | 
| - Unit and e2e tests | ||
|  | ||
| Beta (v1.22): | ||
| - Enable RandomReplicaSetDownscale feature gate by default | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have some user feedback criteria.
Would silent consensus be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking for user feedback is probably a good idea, but otherwise I think no news is good news
        
          
                keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
      0ed4d54    to
    c1bd2f7      
    Compare
  
    | For example, let's assume the base 10 is used, then we have the following | ||
| mapping for different durations: | ||
|  | ||
| | Duration | Scale | | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps clarify that scale translates to the pod rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank as defined in the code? It doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not literally as defined in the code, but that it translates to the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not. It just changes how timestamps are compared, which is the last criteria for Pod comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, may be I am being too brief since I am commenting from the phone (you also need to give me the benefit of the doubt :)), what I meant is that this is the new sorting key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this gets explained with the other comment.
It can be seen as a sorting key, but saying that would create confusion, as it is not the only criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are multiple sorting keys, not just one.
| creation timestamp. | ||
|  | ||
| Instead of directly comparing timestamps, the algorithm compares the elapsed | ||
| times since the timestamp until the current time but in a logarithmic scale, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"since the timestamp"
Which timestamp? creation timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2: creation and ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we have two sorting keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this now too, I thought this was just referring to creationTimestamp (picked this part up from Aldo's draft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diving into implementation details here:
There is not a single sorting key. There are sorting criteria
https://github.com/kubernetes/kubernetes/blob/cac9339/pkg/controller/controller_utils.go#L784-L809
Two of these criteria are ready time and creation time. We should affect both.
Feel free to include some of these details in the KEP, @damemi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so criteria 5 and 7 each will be updated so that the key for each is not the timestamp, but the "scale".
| We propose a randomized approach to the algorithm for Pod victim selection | ||
| during ReplicaSet downscale: | ||
|  | ||
| 1. Do a random shuffle of ReplicaSet Pods. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if just quick sort, with the random pivot, provides the guarantees that any order is equally likely.
If it doesn't, then we would favor removing Pods in the order provided by the lister.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last sorting key can be a random number or the pods' uuid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uid sounds good. The random number would have to be produced before calling sort (and not in the Less function) otherwise it leads to undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the random numbers need to be assigned before hand, hence the uuid suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to sort the pods before or after grouping them into their scale buckets? I thought the goal was to select randomly from the youngest bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is that Pods that belong to the same scale bucket are in a random order.
Maybe we don't need to get into the implementation detail of how we achive that in the KEP, but what it's important to note is what is the precedence of the buckets with regard to the other sorting criteria https://github.com/kubernetes/kubernetes/blob/cac9339/pkg/controller/controller_utils.go#L784-L809
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is that Pods that belong to the same scale bucket are in a random order.
Maybe we don't need to get into the implementation detail
Agree, just making sure I understood the intent.
|  | ||
| ### User Stories | ||
|  | ||
| #### Story 1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
focusing on the upgrade story is more convincing in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, this story could probably just be shortened to the last 2 steps (5,6): where a new domain is added and upscaled to 3N, containing all the youngest pods. Then downscaled to 2N and all the pods from the new domain are removed. Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to just focus on an upgrade, please check my wording :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that's what Abdullah meant, but thanks for the simplification.
For completion, I would say:
A deployment could become imbalanced after:
- An entire failure domain fails and becomes unavailable
- A failure domain gets teared down for maintenance or upgrade
- A new failure domain is added to a cluster
The imbalance cycle goes like follows:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds reasonable, but the upgrade case I had in mind is not necessarily about adding a new failure domain, but about a rolling upgrade of nodes: consider a region with three zones, and the nodes get upgraded one zone at a time. As aldo mentioned, you can say that all those cases can lead to imbalance.
| Note that this proposal is complementary to #1828 | 
c1bd2f7    to
    db57fbd      
    Compare
  
    | Tried to address the feedback so far in a new commit (that I intend to squash) | 
074f9c7    to
    16d8044      
    Compare
  
    | status: implementable | ||
| creation-date: 2020-12-15 | ||
| reviewers: | ||
| - "@janetkuo" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add me as a reviewer.
|  | ||
| 1. Sort ReplicaSet pods by pod UUID. | ||
| 2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815) | ||
| 2. Call sorting algorithm with a modified time comparison for start and | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider algorithm which does 2 as described above but still includes node affinity like it's currently implemented. But instead of the initial sort extend the algorithm such that it will randomly pick from a bucket if len(bucket) > 1, which would limit the amount of sorting needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional problem I'm seeing here is that since you're modifying start and creation timestamp you'll end up copying the pod resources which will increase memory consumption during this phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not copying the timestamp. We just compare it differently.
You can see the proof-of-concept implementation kubernetes/kubernetes#96898
|  | ||
| 1. Sort ReplicaSet pods by pod UUID. | ||
| 2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815) | ||
| 2. Call sorting algorithm with a modified time comparison for start and | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not copying the timestamp. We just compare it differently.
You can see the proof-of-concept implementation kubernetes/kubernetes#96898
| We propose a randomized approach to the algorithm for Pod victim selection | ||
| during ReplicaSet downscale: | ||
|  | ||
| 1. Sort ReplicaSet pods by pod UUID. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You include say the purpose of this: to obtain a pseudo-random shuffle.
Also, you don't have to make it as a first step. It can just be another comparison criteria for ActivePodsWithRnaks.Less after comparing timestamps.
83bf8bf    to
    fb205f5      
    Compare
  
    | Updated to add @soltysh to reviewers. Was there anything else I missed? | 
fb205f5    to
    d891ea6      
    Compare
  
    | Bumping this for @kubernetes/sig-apps-pr-reviews as we approach KEP freeze | 
| /assign @wojtek-t | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR itself looks fine, but I would like to see SIG approval first.
        
          
                keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
              
                Outdated
          
            Show resolved
            Hide resolved
        
      Add summary, motivation, detailed design and alternatives. Signed-off-by: Aldo Culquicondor <[email protected]>
d891ea6    to
    123b9d6      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits.
/lgtm
|  | ||
| ### Risks and Mitigations | ||
|  | ||
| Certain users might be relaying in the existing downscaling heuristic. However, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Certain users might be relaying in the existing downscaling heuristic. However, | |
| Certain users might be relaying on the existing downscaling heuristic. However, | 
| /approve | 
| /approve for Alpha PRR (will require more for for beta) | 
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damemi, kow3ns, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
This introduces the KEP for random replicaset downscale (as previously discussed in kubernetes/kubernetes#96748). Currently addresses sections for alpha designation, looking for feedback from sig-apps and sig-scheduling
re: #2185
/sig apps
/sig scheduling