Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions cluster-autoscaler/proposals/granular-resource-limits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
# Granular Resource Limits in Node Autoscalers

## Objective

Node Autoscalers should allow setting more granular resource limits that would
apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms.

## Background

Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and
memory) and per-node-group node count limits. Karpenter supports
setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits).
Also, as mentioned
in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html),
cluster-wide limits are not supported too. This is not flexible enough for many
use cases.

Users often need to configure more granular limits. For instance, a user might
want to limit the total resources consumed by nodes of a specific machine
family, nodes with a particular OS, or nodes with specialized hardware like
GPUs. The current resource limits implementations in both node autoscalers do
not support these scenarios.

This proposal introduces a new API to extend the Node Autoscalers’
functionality, allowing limits to be applied to arbitrary sets of nodes.

## Proposal: The AutoscalingResourceQuota API

We propose a new Kubernetes custom resource, AutoscalingResourceQuota, to define
resource limits on specific subsets of nodes. Node subsets are targeted using
standard Kubernetes label selectors, offering a flexible way to group nodes.

A node's eligibility for provisioning operation will be checked against all
AutoscalingResourceQuota objects that select it. The operation will only be
permitted if it does not violate any of the applicable limits. This should be
compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits
and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate
AutoscalingResourceQuota, but violates existing limiting mechanisms, it should
be rejected.

### API Specification

An AutoscalingResourceQuota object would look as follows:

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep brainstorming this name. AutoscalingResourceQuota applies specifically to nodes, but Autoscaling is a much more broad domain (i.e. Pod Autoscaling, Volume Autoscaling, etc).

If I remember correctly, we shot down my preferred NodeResourceQuota in a previous doc, but I forget why. We had a similar naming debate on https://github.com/kubernetes/autoscaler/blob/9e22656b32975da37aad7c937031cc8ab1796b91/cluster-autoscaler/proposals/buffers.md (cc: @jbtk). Maybe there's a similar naming solution in CapacityResourceQuota, as we're more or less referring to node.status.capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we shot it down, apparently the thread just got lost: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?disco=AAABsMJ0d48

Let's continue the discussion here. Tl;dr from the thread, why I'm leaning more towards AutoscalingResourceQuota:

  • It's explicit that its scope is limited only to autoscaling - the limits will be respected only by node autoscalers, users will still be able to register the nodes manually (side note: I guess it's up to you to decide how it's going to work with Karpenter static node pools)
  • This name is more future-proof if we were to extend this resource to limit DRA devices. With DRA support, NodeResourceQuota would be a bit inaccurate, since devices != nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the limits will be respected only by node autoscalers

Do you think that CAPI might ever support this API? The ResourceQuota is used more broadly than Pod Autoscaling, it's general purpose for anything applying pods (and other namespaced resources) to the cluster.

This name is more future-proof if we were to extend this resource to limit DRA devices. With DRA support, NodeResourceQuota would be a bit inaccurate, since devices != nodes

This leans me even further towards CapacityResourceQjuota.

metadata:
name: example-resource-quota
spec:
selector:
matchLabels:
example.cloud.com/machine-family: e2
limits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider hard here to conceptually align with https://kubernetes.io/docs/concepts/policy/resource-quotas/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially if we agree on hard/soft limits distinction. Let's see how the discussion goes and let's get back to this one

resources:
cpu: 64
memory: 256Gi
```

* `selector`: A standard Kubernetes label selector that determines which nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finally realized why you called it scopeSelector, it's because it's in ResourceQuota 🤦‍♂️ .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly, though @jonathan-innis noted that scopeSelector in ResourceQuota is a bit different, since it's not just a label selector, but rather a named filter that you need to reference via scopeName field. Here we just want to use a plain label selector. Therefore, indeed scopeSelector might not be an accurate name, and probably it's better to go with selector or nodeSelector (though this one might be inaccurate if we were to support DRA)

the limits apply to. This allows for fine-grained control based on any label
present on the nodes, such as zone, region, OS, machine family, or custom
user-defined labels.
* `limits`: Defines the limits of summed up resources of the selected nodes.

This approach is highly flexible – adding a new dimension for limits only
requires ensuring the nodes are labeled appropriately, with no code changes
needed in the autoscaler.

### Node as a Resource

The AutoscalingResourceQuota API can be naturally extended to treat the number
of nodes itself as a limitable resource, as shown in one of the examples below.

### AutoscalingResourceQuota Status

For better observability, the AutoscalingResourceQuota resource could be
enhanced with a status field. This field, updated by a controller, would display
the current resource usage for the selected nodes, allowing users to quickly
check usage against the defined limits via kubectl describe. The controller can
run in a separate thread as a part of the node autoscaler component.

An example of the status field:

```yaml
status:
usage:
cpu: 32
memory: 128Gi
nodes: 50
```

## Alternatives considered

### Minimum limits support

The initial design, besides the maximum limits, also included minimum limits.
Minimum limits were supposed to affect the node consolidation in the node
autoscalers. A consolidation would be allowed only if removing the node wouldn’t
violate any minimum limits. Cluster-wide minimum limits are implemented in CAS
together with the maximum limits, so at first, it seemed logical to include both
limit directions in the design.

Despite being conceptually similar, minimum and maximum limits cover completely
different use cases. Maximum limits can be used to control the cloud provider
costs, to limit scaling certain types of compute, or to control distribution of
compute resources between teams working on the same cluster. Minimum limits’
main use case is ensuring a baseline capacity for users’ workloads, for example
to handle sudden spikes in traffic. However, minimum limits defined as a minimum
amount of resources in the cluster or a subset of nodes do not guarantee that
the workloads will be schedulable on those resources. For example, two nodes
with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a
workload requesting 2 CPUs, that workload would not fit into existing nodes,
making the baseline capacity effectively useless. This scenario will be better
handled by
the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md),
which allows the user to provide an exact shape of their workloads, including
the resource requests. In our example, the user would create a CapacityBuffer
with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a
pod with that shape is always schedulable on the existing nodes.

Therefore, we decided to remove minimum limits from the design of granular
limits, as CapacityBuffers are a better way to provide a baseline capacity for
user workloads.

### Kubernetes LimitRange and ResourceQuota

It has been discussed whether the same result could be accomplished by using the
standard Kubernetes
resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/)
and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/).

LimitRange is a resource used to configure minimum and maximum resource
constraints for a namespace. For example, it can define the default CPU and
memory requests for pods and containers within a namespace, or enforce a minimum
and maximum CPU request for a pod. However, its scope is limited to a single
resource, meaning that it doesn’t look at all pods in the namespace, but just
looks if the pod requests and limits are within defined bounds.

ResourceQuota allows to define and limit the aggregate resource consumption per
namespace. This includes limiting the total CPU, memory, and storage that all
pods and persistent volume claims within a namespace can request or consume. It
also supports limiting the count of various Kubernetes objects, such as pods,
services, and replication controllers. While resource quotas can be used to
limit the resources provisioned by the CA to some degree, it’s not possible to
guarantee that CA won’t scale up above the defined limit. Since the quotas
operate on pod requests, and CA does not guarantee that bin packing will yield
the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will
stop scaling at 64 CPUs.

Moreover, both of those resources are namespaced, so their scope is limited to
the namespace in which they are defined, while the nodes are global. We can’t
use namespaced resources to limit the creation and deletion of global resources.

## User Stories

### Story 1

As a cluster administrator, I want to configure cluster-wide resource limits to
avoid excessive cloud provider costs.

**Note:** This is already supported in CAS, but not in Karpenter.

Example AutoscalingResourceQuota:

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: cluster-wide-limits
spec:
limits:
resources:
cpu: 128
memory: 256Gi
```

### Story 2

As a cluster administrator, I want to configure separate resource limits for
specific groups of nodes on top of cluster-wide limits, to avoid a situation
where one group of nodes starves others of resources.

**Note:** A specific group of nodes can be either a NodePool in Karpenter, a
ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label.
This can be useful e.g. for organizations where multiple teams are running
workloads in a shared cluster, and these teams have separate sets of nodes. This
way, a cluster administrator can ensure that each team has a proper limit for
their resources and it doesn’t starve other teams. This story is partly
supported by Karpenter’s NodePool limits.

Example AutoscalingResourceQuota:

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: team-a-limits
spec:
selector:
matchLabels:
team: a
limits:
resources:
cpu: 32
```

### Story 3

As a cluster administrator, I want to allow scaling up machines that are more
expensive or less suitable for my workloads when better machines are
unavailable, but I want to limit how many of them can be created, so that I can
control extra cloud provider costs, or limit the impact of using non-optimal
machine for my workloads.

Example AutoscalingResourceQuota:

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: max-e2-resources
spec:
selector:
matchLabels:
example.cloud.com/machine-family: e2
limits:
resources:
cpu: 32
memory: 64Gi
```

### Story 4

As a cluster administrator, I want to limit the number of nodes in a specific
zone if my cluster is unbalanced for any reason, so that I can avoid exhausting
IP space in that zone, or enforce better balancing across zones.

**Note:** Originally requested
in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940).

Example AutoscalingResourceQuota:

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: max-nodes-us-central1-b
spec:
selector:
matchLabels:
topology.kubernetes.io/zone: us-central1-b
limits:
resources:
nodes: 64
Copy link
Contributor

@ellistarn ellistarn Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look at how ResourceQuotas do this? https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-on-object-count

spec:
  hard:
    cpu: 100
    count/nodes: 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about it. count/<resource> syntax counts the objects in API server and is mainly used to protect against exhaustion of control plane storage. For some resources, ResourceQuota also supports a specialized syntax. For instance, you can set count/pods and pods, and these two can mean different things (and have different use cases). count/pods will count all the pods in the namespace, whereas pods will only count pods in non-terminal state in the namespace. Therefore, the former seems to be useful in terms of limiting the storage, and the latter handles the actual business logic, where you only care about active pods (for example due to pod IP space constraints).

Following that logic, nodes seems more suitable to me than count/nodes, as we don't simply count Node objects in the API server. In case of CAS, during scale up operations we also want to count the upcoming nodes, so the nodes that had been provisioned in previous loops, but have not been yet registered. In case of Karpenter. you might want to count NodeClaims in a similar scenario. Also in CAS we plan to allow cloud providers to have custom node filtering logic (as mentioned in #8702 (comment)), for example to filter out surge upgrade nodes in GKE.

Other than that, I think it's just more convenient as an API.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the distinction vs the specific pods vs the generic pods/count. It seems that ResourceQuota evolved from gating Node resources to generic Kubernetes resources. Interesting sleight of hand :).

If we consider the API to be "node specific", which I think is fair, I could see making this field just noderesourcequota.spec.hard.count = 3. If it's not node specific, then something like capacityresourcequota.spec.hard.nodes = 3 might make more sense.

Not too fussed on the differences between those, but I agree that the semantic is different than the "storage limit" you raised. If we supported something like count/nodes, we might find ourselves operating on any global resource count/validatingadmissionpolicy, which is definitely not the intent of this API.

```

### Story 5 (obsolete)

As a cluster administrator, I want to ensure there is always a baseline capacity
in my cluster or specific parts of my cluster below which the node autoscaler
won’t consolidate the nodes, so that my workloads can quickly react to sudden
spikes in traffic.

This user story is obsolete. CapacityBuffer API covers this use case in a more
flexible way.

## Other AutoscalingResourceQuota examples

The following examples illustrate the flexibility of the proposed API and
demonstrate other possible use cases not described in the user stories.

#### **Maximum Windows Nodes**

Limit the total number of nodes running the Windows operating system to 8.

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: max-windows-nodes
spec:
selector:
matchLabels:
kubernetes.io/os: windows
limits:
resources:
nodes: 8
```

#### **Maximum NVIDIA T4 GPUs**

Limit the total number of NVIDIA T4 GPUs in the cluster to 16.

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: max-t4-gpus
spec:
selector:
matchLabels:
example.cloud.com/gpu-type: nvidia-t4
limits:
resources:
nvidia.com/gpu: 16
```

#### **Cluster-wide Limits Excluding Control Plane Nodes**

Apply cluster-wide CPU and memory limits while excluding nodes with the
control-plane role.

```yaml
apiVersion: autoscaling.x-k8s.io/v1beta1
kind: AutoscalingResourceQuota
metadata:
name: cluster-limits-no-control-plane
spec:
selector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
limits:
resources:
cpu: 64
memory: 128Gi
```
Loading