Skip to content

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 4, 2025

  • One-line PR description: update for 1.35

  • Issue link: DRA: device taints and tolerations #5055

  • Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, kubectl describe DeviceTaintRule).

The "informational taint" is a potential alternative to #5469.

The one-of ResourceSlice is relevant for #5234.

/cc @nojnhuh @byako @eero-t @mortent

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: eero-t.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

  • One-line PR description: update for 1.35

  • Issue link: DRA: device taints and tolerations #5055

  • Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, kubectl describe DeviceTaintRule).

The "informational taint" is a potential alternative to #5469.

The one-of ResourceSlice is relevant for #5234.

/cc @nojnhuh @byako @eero-t @mortent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 4, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Sep 4, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2025
@pohly pohly force-pushed the dra-device-taints-1.35 branch from 73ef40f to 100e650 Compare September 4, 2025 09:29
Comment on lines +220 to +222
Once eviction starts, it happens at a low enough rate that admins have a chance
to delete the DeviceTaintRule before all pods are evicted if they made a
mistake after all. This rate is configurable to enable faster eviction, if
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is effect: None + kubectl describe by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the rate limiting would uncover anything that kubectl describe may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).

But primarily it is that kubectl describe is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.

Comment on lines +262 to +263
The semantic of the value associated with a taint key is defined by whoever
publishes taints with that key. DRA drivers should use the driver name as
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth drawing a parallel here to standardized device attributes? At least to call out that taint keys could also become standardized across vendors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's worth describing.

Comment on lines 609 to 610
// EvictionRate controls how quickly Pods get evicted if that is
// the effect of the taint. If multiple taints cause eviction
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this first sentence mention the unit here being pods/s? Or rename the field something like EvictionsPerSecond?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like EvictionsPerSecond.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that it might make sense to also allow a DRA driver to set this because they can set effect: Evict. Therefore I moved this into the DeviceTaint itself, which makes it usable inside a ResourceSlice.

// The length must be smaller or equal to 1024.
//
// +optional
Description *string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose another option could be to include this when applicable inside Data instead of having its own field. Or is the idea that Description would be shown by kubectl describe and Data would not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly the idea behind it being a first-class field.

This now covers publishing purely informational taints ("Effect: None"),
rate limited eviction and improved user experience (DeviceTaintRule status,
`kubectl describe DeviceTaintRule`).
@pohly pohly force-pushed the dra-device-taints-1.35 branch from b6d3123 to 4282b1f Compare September 22, 2025 10:55
// some Pods and then gets restarted before updating this field.
//
// +optional
PodsEvicted *int64
Copy link
Contributor Author

@pohly pohly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided about this field. As it stands, this is a lower bound: "at least this number of pods were evicted, but maybe also more". 99% of the times it'll be accurate, but guaranteeing that would imply evicting a pod and updating this field in a single atomic transaction. If the kube-controller-manager dies and restarts, it cannot figure out whether all evicted pods were accounted for.

When using conditions, determining the old count becomes icky (need to parse string on startup, which may not always work because IMHO we shouldn't make the string format part of the official API).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea with the PodsEvicted field that it will monotonically increase forever? Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.

I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea with the PodsEvicted field that it will monotonically increase forever?

Yes, but per DeviceTaintRule. Basically some feedback to the admin that "yes, this did something".

Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.

Such a metric makes sense for "overall work done". But it cannot be per DeviceTaintRule because using e.g. the DeviceTaintRule UID as dimension would lead to unlimited growth of that metric.

I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.

So you think we can drop PodsEvicted and also not represent it in the condition description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a metric makes sense for "overall work done".

Actually, it already exists: "Total number of Pods deleted by DeviceTaintEvictionController since its start."

I think as part of condition description it can be included like this "10 pods are pending eviction, 100 evicted already since starting the controller". That makes it clear that the count resets when restarting the controller and thus avoids potential confusion when the number is too low.

}

// DeviceTaintRuleStatus provides information about an on-going pod eviction.
type DeviceTaintRuleStatus struct {
Copy link
Contributor Author

@pohly pohly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While implement the type updates I was reminded that bumping the spec increments the Generation. This was meant as preparation for conditions which can record the observed generation to document what they are based on.

But now these new status fields aren't using conditions. Should they? There's no field to store data like "pods pending eviction" in machine-readable form in https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition. The numbers could be part of the message field:

Condition {
   Type: "InProgress",
   Status: ConditionTrue,
   ObservedGeneration: rule.spec.Generation,
   LastTransitionTime: time.Now,
  Reason: "PodsPendingEviction",
  Message: fmt.Sprintf("%d pods are pending eviction, %d evicted already", numPodsPending, numPodsEvicted),
}`

This becomes `Status: ConditionFalse` when there are no pods which need to be evicted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of conditions is that kubectl wait --for=condition=InProgress=false DeviceTaintRule/my-taint-rule can be used, which is nicer than kubectl wait --for=jsonpath='{.status.podsPendingEviction}'=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

4 participants