Skip to content

Conversation

@aleskandro
Copy link
Member

@aleskandro aleskandro commented Jul 22, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

  • Labels set in existing nodes for not-autoscale-from-zero cases
  • Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
  • Values in the status.nodeSystemInfo of MachineTemplates
  • Generic/default labels set in the environment of the cluster autoscaler

Which issue(s) this PR fixes:

Fixes #8347

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Cluster API Provider's implementors can now provide the architecture and the operating system of a future node in their machine template at the field status.nodeInfo to inform these values for scaling from zero.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider and removed do-not-merge/needs-area labels Jul 22, 2025
@k8s-ci-robot k8s-ci-robot requested review from elmiko and hardikdr July 22, 2025 13:00
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 22, 2025
@aleskandro aleskandro force-pushed the arch-aware-node-info branch 5 times, most recently from dd56e8b to 13dea52 Compare July 23, 2025 09:35
@aleskandro aleskandro force-pushed the arch-aware-node-info branch 2 times, most recently from 8ec180c to c3a296a Compare July 23, 2025 12:33
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 23, 2025
@aleskandro aleskandro changed the title WIP: Arch aware node info WIP: Update cluster-api provider to use machineTemplate.status.nodeInfo for architecture-aware autoscale from zero Jul 23, 2025
@aleskandro aleskandro force-pushed the arch-aware-node-info branch 4 times, most recently from 21705b1 to 9f183df Compare August 6, 2025 14:49
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on balance this is making sense, i think we need to clean up a couple things.

reducing the number of api server calls is important for the capi provider as we tend to be very noisy from an HTTP perspective.

@aleskandro aleskandro force-pushed the arch-aware-node-info branch 3 times, most recently from 7e356f2 to 8b78773 Compare August 7, 2025 07:25
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mostly makes sense to me, i have a couple questions.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis would you mind taking a look as well 🙏

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aleskandro
Once this PR has been reviewed and has the lgtm label, please ask for approval from elmiko. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2025
@aleskandro aleskandro force-pushed the arch-aware-node-info branch from 16cfe50 to 53b4e29 Compare October 3, 2025 08:26
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025
@aleskandro aleskandro force-pushed the arch-aware-node-info branch 2 times, most recently from 5683ee1 to b27f5d6 Compare October 3, 2025 08:50
@jackfrancis
Copy link
Contributor

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 3, 2025
@jackfrancis
Copy link
Contributor

/lgtm
/assign @elmiko

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 3, 2025
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good to me, apologies @aleskandro , i changed the test structure to make it less complex with a builder pattern. you'll need to update the unit tests and then probably add a WithNodeInfo function to the builder.

…r architecture-aware autoscale from zero

kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

- Labels set in existing nodes for not-autoscale-from-zero cases
- Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
- Values in the status.nodeSystemInfo of MachineTemplates
- Generic/default labels set in the environment of the cluster autoscaler

# Conflicts:
#	cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
…eInfo for architecture-aware autoscale from zero

kubernetes-sigs/cluster-api#11962 introduced the nodeInfo field for MachineTemplates. Providers can reconcile this field in the status subresource to inform the autoscaler about the architecture and operating system that the MachineTemplate's nodes will run.

Previously, we have been implementing this behavior in the cluster autoscaler by leveraging the labels capacity annotation and, as a fallback, default values set in environment variables at cluster-autoscaler deployment time.

With this commit, the cluster autoscaler computes the future architecture of a node with the following priority order:

- Labels set in existing nodes for not-autoscale-from-zero cases
- Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
- Values in the status.nodeSystemInfo of MachineTemplates
- Generic/default labels set in the environment of the cluster autoscaler
@aleskandro aleskandro force-pushed the arch-aware-node-info branch from b27f5d6 to c36027b Compare October 15, 2025 17:05
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 15, 2025
@elmiko
Copy link
Contributor

elmiko commented Oct 15, 2025

thanks for the updates @aleskandro , i think this is looking good.

/lgtm

i'd like to give @jackfrancis a chance to review again, but i have a feeling we can approve this soon.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025
// - Labels set in the labels capacity annotation of machine template, machine set, and machine deployment.
// - Values in the status.nodeSystemInfo of MachineTemplates
// - Generic/default labels set in the environment of the cluster autoscaler
labels := cloudprovider.JoinStringMaps(buildGenericLabels(nodeName), nsiLabels, ng.scalableResource.Labels())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit academic, but now that there's a chance we'll be passing in a nil map[string]string to cloudprovider.JoinStringMaps, let's update the existing TestJoinStringMaps UT to validate that use case.

Something like

$ git diff
diff --git a/cluster-autoscaler/cloudprovider/util_test.go b/cluster-autoscaler/cloudprovider/util_test.go
index ba23d6180..ee99f2bcb 100644
--- a/cluster-autoscaler/cloudprovider/util_test.go
+++ b/cluster-autoscaler/cloudprovider/util_test.go
@@ -44,9 +44,11 @@ func TestBuildKubeProxy(t *testing.T) {
 }
 
 func TestJoinStringMaps(t *testing.T) {
+       emptyMapBeginning := make(map[string]string)
        map1 := map[string]string{"1": "a", "2": "b"}
        map2 := map[string]string{"3": "c", "2": "d"}
        map3 := map[string]string{"5": "e"}
-       result := JoinStringMaps(map1, map2, map3)
+       emptyMapEnd := make(map[string]string)
+       result := JoinStringMaps(emptyMapBeginning, map1, map2, map3, emptyMapEnd)
        assert.Equal(t, map[string]string{"1": "a", "2": "d", "3": "c", "5": "e"}, result)
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update cluster-api implementation to use MachineTemplate.status.nodeInfo for inferring architecture information for scale from zero

4 participants