-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Granular resource limits proposal #8702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,328 @@ | ||
| # Granular Resource Limits in Node Autoscalers | ||
|
|
||
| ## Objective | ||
|
|
||
| Node Autoscalers should allow setting more granular resource limits that would | ||
| apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms. | ||
|
|
||
| ## Background | ||
|
|
||
| Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and | ||
| memory) and per-node-group node count limits. Karpenter supports | ||
| setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits). | ||
| Also, as mentioned | ||
| in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html), | ||
| cluster-wide limits are not supported too. This is not flexible enough for many | ||
| use cases. | ||
|
|
||
| Users often need to configure more granular limits. For instance, a user might | ||
| want to limit the total resources consumed by nodes of a specific machine | ||
| family, nodes with a particular OS, or nodes with specialized hardware like | ||
| GPUs. The current resource limits implementations in both node autoscalers do | ||
| not support these scenarios. | ||
|
|
||
| This proposal introduces a new API to extend the Node Autoscalers’ | ||
| functionality, allowing limits to be applied to arbitrary sets of nodes. | ||
|
|
||
| ## Proposal: The AutoscalingResourceQuota API | ||
|
|
||
| We propose a new Kubernetes custom resource, AutoscalingResourceQuota, to define | ||
| resource limits on specific subsets of nodes. Node subsets are targeted using | ||
| standard Kubernetes label selectors, offering a flexible way to group nodes. | ||
|
|
||
| A node's eligibility for provisioning operation will be checked against all | ||
| AutoscalingResourceQuota objects that select it. The operation will only be | ||
| permitted if it does not violate any of the applicable limits. This should be | ||
| compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits | ||
| and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate | ||
| AutoscalingResourceQuota, but violates existing limiting mechanisms, it should | ||
| be rejected. | ||
|
|
||
| ### API Specification | ||
|
|
||
| An AutoscalingResourceQuota object would look as follows: | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: example-resource-quota | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| example.cloud.com/machine-family: e2 | ||
| limits: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may want to consider There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Especially if we agree on hard/soft limits distinction. Let's see how the discussion goes and let's get back to this one |
||
| resources: | ||
| cpu: 64 | ||
| memory: 256Gi | ||
| ``` | ||
|
|
||
| * `selector`: A standard Kubernetes label selector that determines which nodes | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have finally realized why you called it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah exactly, though @jonathan-innis noted that |
||
| the limits apply to. This allows for fine-grained control based on any label | ||
| present on the nodes, such as zone, region, OS, machine family, or custom | ||
| user-defined labels. | ||
| * `limits`: Defines the limits of summed up resources of the selected nodes. | ||
|
|
||
| This approach is highly flexible – adding a new dimension for limits only | ||
| requires ensuring the nodes are labeled appropriately, with no code changes | ||
| needed in the autoscaler. | ||
|
|
||
| ### Node as a Resource | ||
|
|
||
| The AutoscalingResourceQuota API can be naturally extended to treat the number | ||
| of nodes itself as a limitable resource, as shown in one of the examples below. | ||
|
|
||
| ### AutoscalingResourceQuota Status | ||
|
|
||
| For better observability, the AutoscalingResourceQuota resource could be | ||
| enhanced with a status field. This field, updated by a controller, would display | ||
| the current resource usage for the selected nodes, allowing users to quickly | ||
| check usage against the defined limits via kubectl describe. The controller can | ||
| run in a separate thread as a part of the node autoscaler component. | ||
|
|
||
| An example of the status field: | ||
|
|
||
| ```yaml | ||
| status: | ||
| usage: | ||
| cpu: 32 | ||
| memory: 128Gi | ||
| nodes: 50 | ||
| ``` | ||
|
|
||
| ## Alternatives considered | ||
|
|
||
| ### Minimum limits support | ||
|
|
||
| The initial design, besides the maximum limits, also included minimum limits. | ||
| Minimum limits were supposed to affect the node consolidation in the node | ||
| autoscalers. A consolidation would be allowed only if removing the node wouldn’t | ||
| violate any minimum limits. Cluster-wide minimum limits are implemented in CAS | ||
| together with the maximum limits, so at first, it seemed logical to include both | ||
| limit directions in the design. | ||
|
|
||
| Despite being conceptually similar, minimum and maximum limits cover completely | ||
| different use cases. Maximum limits can be used to control the cloud provider | ||
| costs, to limit scaling certain types of compute, or to control distribution of | ||
| compute resources between teams working on the same cluster. Minimum limits’ | ||
| main use case is ensuring a baseline capacity for users’ workloads, for example | ||
| to handle sudden spikes in traffic. However, minimum limits defined as a minimum | ||
| amount of resources in the cluster or a subset of nodes do not guarantee that | ||
| the workloads will be schedulable on those resources. For example, two nodes | ||
| with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a | ||
| workload requesting 2 CPUs, that workload would not fit into existing nodes, | ||
| making the baseline capacity effectively useless. This scenario will be better | ||
| handled by | ||
| the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md), | ||
| which allows the user to provide an exact shape of their workloads, including | ||
| the resource requests. In our example, the user would create a CapacityBuffer | ||
| with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a | ||
| pod with that shape is always schedulable on the existing nodes. | ||
|
|
||
| Therefore, we decided to remove minimum limits from the design of granular | ||
| limits, as CapacityBuffers are a better way to provide a baseline capacity for | ||
| user workloads. | ||
|
|
||
| ### Kubernetes LimitRange and ResourceQuota | ||
|
|
||
| It has been discussed whether the same result could be accomplished by using the | ||
| standard Kubernetes | ||
| resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/) | ||
| and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/). | ||
|
|
||
| LimitRange is a resource used to configure minimum and maximum resource | ||
| constraints for a namespace. For example, it can define the default CPU and | ||
| memory requests for pods and containers within a namespace, or enforce a minimum | ||
| and maximum CPU request for a pod. However, its scope is limited to a single | ||
| resource, meaning that it doesn’t look at all pods in the namespace, but just | ||
| looks if the pod requests and limits are within defined bounds. | ||
|
|
||
| ResourceQuota allows to define and limit the aggregate resource consumption per | ||
| namespace. This includes limiting the total CPU, memory, and storage that all | ||
| pods and persistent volume claims within a namespace can request or consume. It | ||
| also supports limiting the count of various Kubernetes objects, such as pods, | ||
| services, and replication controllers. While resource quotas can be used to | ||
| limit the resources provisioned by the CA to some degree, it’s not possible to | ||
| guarantee that CA won’t scale up above the defined limit. Since the quotas | ||
| operate on pod requests, and CA does not guarantee that bin packing will yield | ||
| the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will | ||
| stop scaling at 64 CPUs. | ||
|
|
||
| Moreover, both of those resources are namespaced, so their scope is limited to | ||
| the namespace in which they are defined, while the nodes are global. We can’t | ||
| use namespaced resources to limit the creation and deletion of global resources. | ||
|
|
||
| ## User Stories | ||
|
|
||
| ### Story 1 | ||
|
|
||
| As a cluster administrator, I want to configure cluster-wide resource limits to | ||
| avoid excessive cloud provider costs. | ||
|
|
||
| **Note:** This is already supported in CAS, but not in Karpenter. | ||
|
|
||
| Example AutoscalingResourceQuota: | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: cluster-wide-limits | ||
| spec: | ||
| limits: | ||
| resources: | ||
| cpu: 128 | ||
| memory: 256Gi | ||
| ``` | ||
|
|
||
| ### Story 2 | ||
|
|
||
| As a cluster administrator, I want to configure separate resource limits for | ||
| specific groups of nodes on top of cluster-wide limits, to avoid a situation | ||
| where one group of nodes starves others of resources. | ||
|
|
||
| **Note:** A specific group of nodes can be either a NodePool in Karpenter, a | ||
| ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label. | ||
| This can be useful e.g. for organizations where multiple teams are running | ||
| workloads in a shared cluster, and these teams have separate sets of nodes. This | ||
| way, a cluster administrator can ensure that each team has a proper limit for | ||
| their resources and it doesn’t starve other teams. This story is partly | ||
| supported by Karpenter’s NodePool limits. | ||
|
|
||
| Example AutoscalingResourceQuota: | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: team-a-limits | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| team: a | ||
| limits: | ||
| resources: | ||
| cpu: 32 | ||
| ``` | ||
|
|
||
| ### Story 3 | ||
|
|
||
| As a cluster administrator, I want to allow scaling up machines that are more | ||
| expensive or less suitable for my workloads when better machines are | ||
| unavailable, but I want to limit how many of them can be created, so that I can | ||
| control extra cloud provider costs, or limit the impact of using non-optimal | ||
| machine for my workloads. | ||
|
|
||
| Example AutoscalingResourceQuota: | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: max-e2-resources | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| example.cloud.com/machine-family: e2 | ||
| limits: | ||
| resources: | ||
| cpu: 32 | ||
| memory: 64Gi | ||
| ``` | ||
|
|
||
| ### Story 4 | ||
|
|
||
| As a cluster administrator, I want to limit the number of nodes in a specific | ||
| zone if my cluster is unbalanced for any reason, so that I can avoid exhausting | ||
| IP space in that zone, or enforce better balancing across zones. | ||
|
|
||
| **Note:** Originally requested | ||
| in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940). | ||
|
|
||
| Example AutoscalingResourceQuota: | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: max-nodes-us-central1-b | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| topology.kubernetes.io/zone: us-central1-b | ||
| limits: | ||
| resources: | ||
| nodes: 64 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you look at how ResourceQuotas do this? https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-on-object-count There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I was thinking about it. Following that logic, Other than that, I think it's just more convenient as an API. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see the distinction vs the specific If we consider the API to be "node specific", which I think is fair, I could see making this field just Not too fussed on the differences between those, but I agree that the semantic is different than the "storage limit" you raised. If we supported something like |
||
| ``` | ||
|
|
||
| ### Story 5 (obsolete) | ||
|
|
||
| As a cluster administrator, I want to ensure there is always a baseline capacity | ||
| in my cluster or specific parts of my cluster below which the node autoscaler | ||
| won’t consolidate the nodes, so that my workloads can quickly react to sudden | ||
| spikes in traffic. | ||
|
|
||
| This user story is obsolete. CapacityBuffer API covers this use case in a more | ||
| flexible way. | ||
|
|
||
| ## Other AutoscalingResourceQuota examples | ||
|
|
||
| The following examples illustrate the flexibility of the proposed API and | ||
| demonstrate other possible use cases not described in the user stories. | ||
|
|
||
| #### **Maximum Windows Nodes** | ||
|
|
||
| Limit the total number of nodes running the Windows operating system to 8. | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: max-windows-nodes | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| kubernetes.io/os: windows | ||
| limits: | ||
| resources: | ||
| nodes: 8 | ||
| ``` | ||
|
|
||
| #### **Maximum NVIDIA T4 GPUs** | ||
|
|
||
| Limit the total number of NVIDIA T4 GPUs in the cluster to 16. | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: max-t4-gpus | ||
| spec: | ||
| selector: | ||
| matchLabels: | ||
| example.cloud.com/gpu-type: nvidia-t4 | ||
| limits: | ||
| resources: | ||
| nvidia.com/gpu: 16 | ||
| ``` | ||
|
|
||
| #### **Cluster-wide Limits Excluding Control Plane Nodes** | ||
|
|
||
| Apply cluster-wide CPU and memory limits while excluding nodes with the | ||
| control-plane role. | ||
|
|
||
| ```yaml | ||
| apiVersion: autoscaling.x-k8s.io/v1beta1 | ||
| kind: AutoscalingResourceQuota | ||
| metadata: | ||
| name: cluster-limits-no-control-plane | ||
| spec: | ||
| selector: | ||
| matchExpressions: | ||
| - key: node-role.kubernetes.io/control-plane | ||
| operator: DoesNotExist | ||
| limits: | ||
| resources: | ||
| cpu: 64 | ||
| memory: 128Gi | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep brainstorming this name.
AutoscalingResourceQuotaapplies specifically to nodes, but Autoscaling is a much more broad domain (i.e. Pod Autoscaling, Volume Autoscaling, etc).If I remember correctly, we shot down my preferred
NodeResourceQuotain a previous doc, but I forget why. We had a similar naming debate on https://github.com/kubernetes/autoscaler/blob/9e22656b32975da37aad7c937031cc8ab1796b91/cluster-autoscaler/proposals/buffers.md (cc: @jbtk). Maybe there's a similar naming solution inCapacityResourceQuota, as we're more or less referring tonode.status.capacity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we shot it down, apparently the thread just got lost: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?disco=AAABsMJ0d48
Let's continue the discussion here. Tl;dr from the thread, why I'm leaning more towards
AutoscalingResourceQuota:NodeResourceQuotawould be a bit inaccurate, since devices != nodesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that CAPI might ever support this API? The
ResourceQuotais used more broadly than Pod Autoscaling, it's general purpose for anything applying pods (and other namespaced resources) to the cluster.This leans me even further towards
CapacityResourceQjuota.