-
Notifications
You must be signed in to change notification settings - Fork 15.1k
KEP-4671 Add docs for Workload API and Gang scheduling #53296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-1.35
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,145 @@ | ||||||
| --- | ||||||
| title: Workload Aware Scheduling | ||||||
| content_type: concept | ||||||
| weight: 120 | ||||||
| --- | ||||||
|
|
||||||
| This page provides an overview of Workload Aware Scheduling (WAS), a Kubernetes feature | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| that enables the kube-scheduler to manage groups of related Pods as a single unit. | ||||||
|
|
||||||
| ## What is Workload Aware Scheduling? | ||||||
|
|
||||||
| The default Kubernetes scheduler makes decisions for one Pod at a time. This model works sufficiently good for stateless applications, | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't exactly true. The default scheduler's behavior, at the time this doc is live, depends on whether you have enabled the v1.35 K8s will, of course, support gang scheduling (as alpha), in-tree. |
||||||
| but can be inefficient for tightly-coupled workloads like those found in machine learning, scientific computing, or big data analytics. | ||||||
| These applications often require that a group of Pods must run concurrently to make any progress. | ||||||
|
|
||||||
| When the scheduler places these Pods individually, it can lead to resource deadlocks or placement inefficiencies. | ||||||
| For example, half of a job's Pods might be scheduled, consuming cluster resources, while the other half remains pending | ||||||
| because no single node has enough capacity for them. The job cannot run, | ||||||
| but the scheduled Pods waste expensive resources that other applications could use. | ||||||
|
|
||||||
| Workload Aware Scheduling introduces a mechanism for the scheduler to identify and manage a group of Pods as a single, atomic workload. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aim to write the documentation mostly as if the feature is already generally available, and then garnish it with caveats about it actually being alpha.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good documentation is often timeless |
||||||
| This allows for more intelligent placement decisions and is the foundation for features like gang scheduling. | ||||||
|
|
||||||
| The `Workload` API is used to express these group scheduling requirements. | ||||||
|
|
||||||
| ## Workload API | ||||||
|
|
||||||
| {{< feature-state feature_gate_name="GenericWorkload" >}} | ||||||
|
|
||||||
| The `Workload` API resource, available from the `scheduling.k8s.io/v1alpha1` API group, allows you to logically group a set of Pods. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Following the style guide, you wouldn't use backticks around Workload. I recommend removing them. |
||||||
| You then link Pods to a `Workload` to inform the scheduler that they should be considered together. | ||||||
| Controllers for high-level resources, such as the Job controller, can create `Workload` objects to communicate placement requirements to the scheduler. | ||||||
|
|
||||||
| A `Workload` resource defines one or more `PodGroup`s. Each `PodGroup` specifies a set of Pods and the scheduling policy that applies to them. | ||||||
|
|
||||||
| Here is an example manifest for a `Workload` that defines a gang of three Pods: | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: scheduling.k8s.io/v1alpha1 | ||||||
| kind: Workload | ||||||
| metadata: | ||||||
| name: training-job-workload | ||||||
| spec: | ||||||
| # controllerRef provides a link to the object that manages this Workload, | ||||||
| # such as a Kubernetes Job. This is for tooling and observability. | ||||||
| controllerRef: | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we may need to explain the difference between "the Job controller" (which is a controller) and "a Job" (which represents a desired and observed state that the Job controller operates on) |
||||||
| apiGroup: batch | ||||||
| kind: Job | ||||||
| name: training-job | ||||||
|
|
||||||
| podGroups: | ||||||
| - name: workers | ||||||
| policy: | ||||||
| gang: | ||||||
| # The minimum number of Pods from this group that must be schedulable | ||||||
| # at the same time for any of them to be scheduled. | ||||||
| minCount: 3 | ||||||
| ``` | ||||||
| To associate a Pod with this `Workload`, you add a `spec.workloadRef` field to the Pod's manifest. | ||||||
| This field creates a link to a specific `PodGroup` within the `Workload`. | ||||||
|
|
||||||
| ```yaml | ||||||
| apiVersion: v1 | ||||||
| kind: Pod | ||||||
| metadata: | ||||||
| name: training-pod-1 | ||||||
| spec: | ||||||
| # This reference links the Pod to the 'workers' PodGroup | ||||||
| # inside the 'training-job-workload' Workload. | ||||||
| workloadRef: | ||||||
| name: training-job-workload | ||||||
| podGroup: workers | ||||||
| # ... | ||||||
| ``` | ||||||
|
|
||||||
| ### Pod Group Replicas | ||||||
|
|
||||||
| For more complex scenarios, you can partition a single `PodGroup` into replicated, independent gangs. | ||||||
| You achieve this using the `podGroupReplicaKey` field within a Pod's `workloadRef`. This key acts as a label | ||||||
| to create logical subgroups. The scheduling algorithm is then applied to each subgroup separately. | ||||||
|
|
||||||
| For example, if you have a `PodGroup` with `minCount: 2` and you create four Pods: two with `podGroupReplicaKey: "0"` | ||||||
| and two with `podGroupReplicaKey: "1"`, the scheduler will treat them as two independent gangs of two Pods. | ||||||
| ```yaml | ||||||
| apiVersion: v1 | ||||||
| kind: Pod | ||||||
| metadata: | ||||||
| name: training-pod-1 | ||||||
| spec: | ||||||
| workloadRef: | ||||||
| name: training-job-workload | ||||||
| podGroup: workers | ||||||
| # All workers with the replica key "0" will be scheduled together as one gang. | ||||||
| podGroupReplicaKey: "0" | ||||||
| # ... | ||||||
| ``` | ||||||
|
|
||||||
| ## Scheduling Policies | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should update the Policies concept page to hyperlink to wherever we end up putting this section. |
||||||
|
|
||||||
| For each `PodGroup`, you must specify exactly one scheduling policy. The two available policies are `basic` and `gang`. | ||||||
|
|
||||||
| ### Basic Scheduling | ||||||
|
|
||||||
| The `basic` policy instructs the scheduler to treat all Pods in a `PodGroup` as independent entities, | ||||||
| scheduling them using the standard Kubernetes behavior. | ||||||
|
|
||||||
| Currently, the main reason to use the `basic` policy is to organize the Pods within your `Workload` | ||||||
| for better observability and management. | ||||||
|
|
||||||
| While this policy does not add any special group-level scheduling constraints today, | ||||||
| it provides a foundation for future enhancements. For example, future versions of Kubernetes | ||||||
| might introduce group-level constraints that apply to a `PodGroup` without requiring | ||||||
| the all-or-nothing semantics of gang scheduling. | ||||||
|
|
||||||
| ### Gang Scheduling | ||||||
|
|
||||||
| {{< feature-state feature_gate_name="GangScheduling" >}} | ||||||
|
|
||||||
| The `gang` policy enforces gang scheduling placement, which ensures a group of Pods are scheduled on an "all-or-nothing" basis. | ||||||
| Without gang scheduling, a job might be partially scheduled, leading to resource wastage and potential deadlocks. | ||||||
|
|
||||||
| The `GangScheduling` plugin uses the `Workload` API to implement its logic. | ||||||
| When you create Pods that are part of a gang-scheduled `PodGroup`, the scheduler follows this process, | ||||||
| independently per each `PodGroup` and its replica key: | ||||||
|
|
||||||
| 1. The scheduler waits on `PreEnqueue` until the number of Pods that have been created for the specific `PodGroup` | ||||||
| is at least equal to the `minCount`. Pods do not enter the active scheduling queue until this condition is met. | ||||||
|
|
||||||
| 2. Once the quorum is met, the scheduler attempts to find node placements for all Pods in the group by taking them pod-by-pod. | ||||||
| All assigned Pods wait at a `WaitOnPermit` gate during this process. Future versions will introduce a new, | ||||||
| single cycle scheduling phase to find the placement for the entire group at once. | ||||||
|
|
||||||
| 3. If the scheduler finds valid placements for at least `minCount` Pods, it allows all of them to be bound to their assigned nodes. | ||||||
| If it cannot find placements for the entire group within a fixed, 5 minutes timeout, none of the Pods are scheduled. | ||||||
| Instead, they are moved to the unschedulable queue to wait for cluster resources to free up. | ||||||
|
|
||||||
| ## Enabling the Features | ||||||
|
|
||||||
| To use the Workload API in your cluster, you must enable the `GenericWorkload` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) | ||||||
| on both the `kube-apiserver` and `kube-scheduler`. | ||||||
|
|
||||||
| To use Gang Scheduling, you must also enable the `GangScheduling` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) | ||||||
| on the `kube-scheduler`. When this feature gate is enabled, the `GangScheduling` plugin is enabled by default in the scheduler's profiles. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| --- | ||
| title: GangScheduling | ||
| content_type: feature_gate | ||
| _build: | ||
| list: never | ||
| render: false | ||
|
|
||
| stages: | ||
| - stage: alpha | ||
| defaultValue: false | ||
| fromVersion: "1.35" | ||
| --- | ||
|
|
||
| Enables the GangScheduling plugin in kube-scheduler, which implements "all-or-nothing" | ||
| scheduling algorithm. The Workload API is used to express the requirements. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (nit) Workload could be a hyperlink |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| --- | ||
| title: GenericWorkload | ||
| content_type: feature_gate | ||
| _build: | ||
| list: never | ||
| render: false | ||
|
|
||
| stages: | ||
| - stage: alpha | ||
| defaultValue: false | ||
| fromVersion: "1.35" | ||
| --- | ||
|
|
||
| Enables the scheduling.k8s.io/v1alpha1 Workload API to express scheduling requirements | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No it doesn't, surely. You must also enable that API group separately? |
||
| at the workload level. Pods can now reference a specific Workload PodGroup using the spec.workloadRef field. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (nit) PodGroup and Workload could be hyperlinks. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't add this file at all.