-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP 5055: DRA Device Taints: update for 1.35 #5512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@pohly: GitHub didn't allow me to request PR reviews from the following users: eero-t. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
73ef40f
to
100e650
Compare
keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md
Outdated
Show resolved
Hide resolved
Once eviction starts, it happens at a low enough rate that admins have a chance | ||
to delete the DeviceTaintRule before all pods are evicted if they made a | ||
mistake after all. This rate is configurable to enable faster eviction, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is effect: None
+ kubectl describe
by itself an adequate safeguard against erroneous taints? Or are there different kinds of mistakes that only rate limiting eviction would mitigate? In general I like the idea of effect: None
essentially being a dry run instead of trying to determine in real-time whether a taint is working while it is or is not actively evicting pods. Wondering if that's a worthwhile way we could narrow the scope here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the rate limiting would uncover anything that kubectl describe
may have missed, except perhaps a race (one pods shown as to be evicted, large number of additional such pods created, then turning on eviction and deleting all of them).
But primarily it is that kubectl describe
is optional, some admin might forget to double-check. Then the rate limiting may help as a second line of defense.
The semantic of the value associated with a taint key is defined by whoever | ||
publishes taints with that key. DRA drivers should use the driver name as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth drawing a parallel here to standardized device attributes? At least to call out that taint keys could also become standardized across vendors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's worth describing.
keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md
Outdated
Show resolved
Hide resolved
// EvictionRate controls how quickly Pods get evicted if that is | ||
// the effect of the taint. If multiple taints cause eviction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this first sentence mention the unit here being pods/s? Or rename the field something like EvictionsPerSecond
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like EvictionsPerSecond
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that it might make sense to also allow a DRA driver to set this because they can set effect: Evict
. Therefore I moved this into the DeviceTaint
itself, which makes it usable inside a ResourceSlice.
// The length must be smaller or equal to 1024. | ||
// | ||
// +optional | ||
Description *string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose another option could be to include this when applicable inside Data
instead of having its own field. Or is the idea that Description
would be shown by kubectl describe
and Data
would not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly the idea behind it being a first-class field.
100e650
to
b6d3123
Compare
keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md
Outdated
Show resolved
Hide resolved
This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status, `kubectl describe DeviceTaintRule`).
b6d3123
to
4282b1f
Compare
// some Pods and then gets restarted before updating this field. | ||
// | ||
// +optional | ||
PodsEvicted *int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undecided about this field. As it stands, this is a lower bound: "at least this number of pods were evicted, but maybe also more". 99% of the times it'll be accurate, but guaranteeing that would imply evicting a pod and updating this field in a single atomic transaction. If the kube-controller-manager dies and restarts, it cannot figure out whether all evicted pods were accounted for.
When using conditions, determining the old count becomes icky (need to parse string on startup, which may not always work because IMHO we shouldn't make the string format part of the official API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea with the PodsEvicted
field that it will monotonically increase forever? Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.
I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea with the PodsEvicted field that it will monotonically increase forever?
Yes, but per DeviceTaintRule
. Basically some feedback to the admin that "yes, this did something".
Would it make more sense to expose that as a metric tied to the lifetime of a kube-controller-manager instance? Then Prometheus would make it easier to add up the total evictions for a given DeviceTaintRule among all past and present kube-controller-manager pods without Kubernetes having to handle that itself.
Such a metric makes sense for "overall work done". But it cannot be per DeviceTaintRule
because using e.g. the DeviceTaintRule
UID as dimension would lead to unlimited growth of that metric.
I think the API is still a good place to at least keep some sign of "in progress"/"done" but maybe doesn't need to detail the exact progress. For that I think a condition makes sense.
So you think we can drop PodsEvicted
and also not represent it in the condition description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a metric makes sense for "overall work done".
Actually, it already exists: "Total number of Pods deleted by DeviceTaintEvictionController since its start."
I think as part of condition description it can be included like this "10 pods are pending eviction, 100 evicted already since starting the controller". That makes it clear that the count resets when restarting the controller and thus avoids potential confusion when the number is too low.
} | ||
|
||
// DeviceTaintRuleStatus provides information about an on-going pod eviction. | ||
type DeviceTaintRuleStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While implement the type updates I was reminded that bumping the spec increments the Generation
. This was meant as preparation for conditions which can record the observed generation to document what they are based on.
But now these new status fields aren't using conditions. Should they? There's no field to store data like "pods pending eviction" in machine-readable form in https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition. The numbers could be part of the message field:
Condition {
Type: "InProgress",
Status: ConditionTrue,
ObservedGeneration: rule.spec.Generation,
LastTransitionTime: time.Now,
Reason: "PodsPendingEviction",
Message: fmt.Sprintf("%d pods are pending eviction, %d evicted already", numPodsPending, numPodsEvicted),
}`
This becomes `Status: ConditionFalse` when there are no pods which need to be evicted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of conditions is that kubectl wait --for=condition=InProgress=false DeviceTaintRule/my-taint-rule
can be used, which is nicer than kubectl wait --for=jsonpath='{.status.podsPendingEviction}'=0
.
One-line PR description: update for 1.35
Issue link: DRA: device taints and tolerations #5055
Other comments: This now covers publishing purely informational taints ("Effect: None"), rate limited eviction and improved user experience (DeviceTaintRule status,
kubectl describe DeviceTaintRule
).The "informational taint" is a potential alternative to #5469.
The one-of ResourceSlice is relevant for #5234.
/cc @nojnhuh @byako @eero-t @mortent