Skip to content

Conversation

@norbertcyran
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a proposal for support of granular resource limits in node autoscalers.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @norbertcyran. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbertcyran
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@norbertcyran
Copy link
Contributor Author

This proposal was initially discussed within autoscaling SIG in this doc: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?usp=sharing

@ellistarn
Copy link
Contributor

FYI -- there is a karpenter specific proposal for this feature: https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R23. One of the key challenges mentioned is the "launch before terminate" challenge when doing upgrades and gracefully rolling capacity. This can cause you to get stuck if you're at your limits.

Thoughts on expanding this proposal to reason about this case?

@norbertcyran
Copy link
Contributor Author

FYI -- there is a karpenter specific proposal for this feature: https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R23. One of the key challenges mentioned is the "launch before terminate" challenge when doing upgrades and gracefully rolling capacity. This can cause you to get stuck if you're at your limits.

Thoughts on expanding this proposal to reason about this case?

From the perspective of the current state of CAS, differentiation between hard and soft limits does not make much sense, as there's no equivalent of Karpenter's node replacement. Scale down only consolidates the node if all pods running on it can be scheduled on other nodes already existing in the cluster. Therefore, functionality-wise, there would be no difference between hard and soft limits. However, such a solution would be more future-proof.

FWIW, in GKE we have surge upgrades: https://docs.cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#surge. As they are not handled by CAS, they bypass resource limits, but additionally we don't want the surge nodes to be counted towards the limits at all (i.e. if there exist both the old node and the surge node, we count them as one), so they don't block scale ups. To handle that, in the CAS resource limits implementation we plan to have a NodeFilter interface allowing to filter out certain nodes from the usage calculation: https://github.com/kubernetes/autoscaler/pull/8662/files#diff-03d8f6b8cba668e6137329c4df0e8a979be244ea7c7465a4d8f10ca08849eb5aR12-R14, https://github.com/kubernetes/autoscaler/pull/8662/files#diff-03d8f6b8cba668e6137329c4df0e8a979be244ea7c7465a4d8f10ca08849eb5aR35-R37

Alternatively to the soft and hard limits, I was thinking that something like NodeFilter could be used to bypass the limits during the graceful consolidation, but it doesn't seem perfect either:

  • some users might expect that the autoscaler never exceeds the limits, even temporarily
  • some users might be constrained by external factors, such as IPs, licensing (as mentioned in the doc you linked), so exceeding the limits might be simply not possible
  • user can't control by how much the limits are exceeded. User might want to specify something like "I want to allow exceeding the limit temporarily during the consolidation, but only by x nodes". x could be defined as a difference between soft and hard limits

Having that said, I can see the potential usefulness of having both soft and hard limits, even if it makes the API slightly more complicated.

Would you suggest having an API like that?

...
limits:
  soft:
    resources:
      nodes: 8
  hard:
    resources:
      nodes: 12

@towca @x13n do you have any thoughts about that?

@ellistarn
Copy link
Contributor

  • some users might expect that the autoscaler never exceeds the limits, even temporarily
  • some users might be constrained by external factors, such as IPs, licensing (as mentioned in the doc you linked), so exceeding the limits might be simply not possible
  • user can't control by how much the limits are exceeded. User might want to specify something like "I want to allow exceeding the limit temporarily during the consolidation, but only by x nodes". x could be defined as a difference between soft and hard limits

Agree with these. Personally, I am not sure I am convinced by the soft/hard proposal. As you mention, there are other factors (like IPs) that cause limits as well. I suspect we will be forced to "terminate-before-launch" when at those limits (within PDBs).

I lean towards a solution that treats limits as best-effort, where there are cases (i.e. surge updates) where the limits can temporarily be exceeded. I'm not sure the state of the art today, but when we first released limits in Karpenter, it was best effort due to launch parallelism.

Mostly, I wanted to highlight this problem and get you in touch with the karpenter folks who are thinking about this (@maxcao13, @jmdeal, @jonathan-innis). I want to avoid Karpenter's limit semantics diverging from SIG Autoscaling API standards unless absolutely necessary.

@maxcao13
Copy link
Member

Thanks for the ping.

Yeah, I have a similar proposal for Karpenter itself, since it does not support node limits for non-static capacity: kubernetes-sigs/karpenter#2525

I'm proposing soft/hard limits in a similar way. The API fields are slightly different because of backwards compatibility concerns, but generally the semantics I am agreeing with where soft limits can temporarily exceed the limit, but hard limits definitively constrain nodes from ever going over.

If we can agree on one semantic across both proposals that would be ideal. For now I don't see a reason why it would be necessary to differ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants