Skip to content

Commit c5c1127

Browse files
authored
Merge pull request #1601 from tkatila/levelzero-hierarchy
Change GPU plugin's behavior as Level zero's default hierarchy mode changed from composite to flat
2 parents de8f196 + 95b7230 commit c5c1127

File tree

14 files changed

+501
-262
lines changed

14 files changed

+501
-262
lines changed

cmd/gpu_plugin/README.md

Lines changed: 44 additions & 190 deletions
Large diffs are not rendered by default.

cmd/gpu_plugin/advanced-install.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Alternative installation methods for Intel GPU plugin
2+
3+
## Install to all nodes
4+
5+
In case the target cluster will not have NFD (or you don't want to install it), Intel GPU plugin can be installed to all nodes. This installation method will consume little unnecessary CPU resources on nodes without Intel GPUs.
6+
7+
```bash
8+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin?ref=<RELEASE_VERSION>'
9+
```
10+
11+
## Install to nodes via NFD, with Monitoring and Shared-dev
12+
13+
Intel GPU plugin is installed via NFD's labels and node selector. Plugin is configured with monitoring and shared devices enabled. This option is useful when there is a desire to retrieve GPU metrics from nodes. For example with [XPU-Manager](https://github.com/intel/xpumanager/) or [collectd](https://github.com/collectd/collectd/tree/collectd-6.0).
14+
15+
```bash
16+
# Start NFD - if your cluster doesn't have NFD installed yet
17+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd?ref=<RELEASE_VERSION>'
18+
19+
# Create NodeFeatureRules for detecting GPUs on nodes
20+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/node-feature-rules?ref=<RELEASE_VERSION>'
21+
22+
# Create GPU plugin daemonset
23+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin/overlays/monitoring_shared-dev_nfd/?ref=<RELEASE_VERSION>'
24+
```

cmd/gpu_plugin/driver-firmware.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Driver and firmware for Intel GPUs
2+
3+
Access to a GPU device requires firmware, kernel and user-space
4+
drivers supporting it. Firmware and kernel driver need to be on the
5+
host, user-space drivers in the GPU workload containers.
6+
7+
Intel GPU devices supported by the current kernel can be listed with:
8+
```
9+
$ grep i915 /sys/class/drm/card?/device/uevent
10+
/sys/class/drm/card0/device/uevent:DRIVER=i915
11+
/sys/class/drm/card1/device/uevent:DRIVER=i915
12+
```
13+
14+
## Drivers for discrete GPUs
15+
16+
> **Note**: Kernel (on host) and user-space drivers (in containers)
17+
> should be installed from the same repository as there are some
18+
> differences between DKMS and upstream GPU driver uAPI.
19+
20+
##### Kernel driver
21+
22+
###### Intel DKMS packages
23+
24+
`i915` GPU driver DKMS[^dkms] package is recommended for Intel
25+
discrete GPUs, until their support in upstream is complete. DKMS
26+
package(s) can be installed from Intel package repositories for a
27+
subset of older kernel versions used in enterprise / LTS
28+
distributions:
29+
https://dgpu-docs.intel.com/installation-guides/index.html
30+
31+
[^dkms]: [intel-gpu-i915-backports](https://github.com/intel-gpu/intel-gpu-i915-backports).
32+
33+
###### Upstream kernel
34+
35+
Support for first Intel discrete GPUs was added to upstream Linux kernel in v6.2,
36+
and expanded in later versions. For now, upstream kernel is still missing support
37+
for few of the features available in DKMS kernels, listed here:
38+
https://dgpu-docs.intel.com/driver/kernel-driver-types.html
39+
40+
##### GPU Version
41+
42+
PCI IDs for the Intel GPUs on given host can be listed with:
43+
```
44+
$ lspci | grep -e VGA -e Display | grep Intel
45+
88:00.0 Display controller: Intel Corporation Device 56c1 (rev 05)
46+
8d:00.0 Display controller: Intel Corporation Device 56c1 (rev 05)
47+
```
48+
49+
(`lspci` lists GPUs with display support as "VGA compatible controller",
50+
and server GPUs without display support, as "Display controller".)
51+
52+
A mapping between GPU PCI IDs and their Intel brand names is available here:
53+
https://dgpu-docs.intel.com/devices/hardware-table.html
54+
55+
###### GPU Firmware
56+
57+
If your kernel build does not find the correct firmware version for
58+
a given GPU from the host (see `dmesg | grep i915` output), latest
59+
firmware versions are available in upstream:
60+
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
61+
62+
##### User-space drivers
63+
64+
Until new enough user-space drivers (supporting also discrete GPUs)
65+
are available directly from distribution package repositories, they
66+
can be installed to containers from Intel package repositories. See:
67+
https://dgpu-docs.intel.com/installation-guides/index.html
68+
69+
Example container is listed in [Testing and demos](#testing-and-demos).
70+
71+
Validation status against *upstream* kernel is listed in the user-space drivers release notes:
72+
* Media driver: https://github.com/intel/media-driver/releases
73+
* Compute driver: https://github.com/intel/compute-runtime/releases
74+
75+
#### Drivers for older (integrated) GPUs
76+
77+
For the older (integrated) GPUs, new enough firmware and kernel driver
78+
are typically included already with the host OS, and new enough
79+
user-space drivers (for the GPU containers) are in the host OS
80+
repositories.

cmd/gpu_plugin/fractional.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# GPU plugin with GPU Aware Scheduling
2+
3+
This is an experimental feature.
4+
5+
Installing the GPU plugin with [GPU Aware Scheduling](https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling) (GAS) enables containers to request partial (fractional) GPU resources. For example, a Pod's container can request GPU's millicores or memory and use only a fraction of the GPU. The remaining resources could be leveraged by another container.
6+
7+
> *NOTE*: For this use case to work properly, all GPUs in a given node should provide equal amount of resources
8+
i.e. heterogenous GPU nodes are not supported.
9+
10+
> *NOTE*: Resource values are used only for scheduling workloads to nodes, not for limiting their GPU usage on the nodes. Container requesting 50% of the GPU's resources is not restricted by the kernel driver or firmware from using more than 50% of the resources. A container requesting 1% of the GPU could use 100% of it.
11+
12+
## Install GPU Aware Scheduling
13+
14+
GAS' installation is described in its [README](https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling#usage-with-nfd-and-the-gpu-plugin).
15+
16+
## Install GPU plugin with fractional resources
17+
18+
### With yaml deployments
19+
20+
The GPU Plugin DaemonSet needs additional RBAC-permissions and access to the kubelet podresources
21+
gRPC service to function. All the required changes are gathered in the `fractional_resources`
22+
overlay. Install GPU plugin by running:
23+
24+
```bash
25+
# Start NFD - if your cluster doesn't have NFD installed yet
26+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd?ref=<RELEASE_VERSION>'
27+
28+
# Create NodeFeatureRules for detecting GPUs on nodes
29+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/node-feature-rules?ref=<RELEASE_VERSION>'
30+
31+
# Create GPU plugin daemonset
32+
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/gpu_plugin/overlays/fractional_resources?ref=<RELEASE_VERSION>'
33+
```
34+
35+
### With Device Plugin Operator
36+
37+
Install the Device Plugin Operator according to the [install](../operator/README.md#installation) instructions. When applying the [GPU plugin Custom Resource](../../deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml) (CR), set `resourceManager` option to `true`. The Operator will install all the required RBAC objects and service accounts.
38+
39+
```
40+
spec:
41+
resourceManager: true
42+
```
43+
44+
## Details about fractional resources
45+
46+
Use of fractional GPU resources requires that the cluster has node extended resources with the name prefix `gpu.intel.com/`. Those are automatically created by GPU plugin with the help of the NFD. When fractional resources are enabled, the plugin lets GAS do card selection decisions based on resource availability and the amount of extended resources requested in the [pod spec](https://github.com/intel/platform-aware-scheduling/blob/master/gpu-aware-scheduling/docs/usage.md#pods).
47+
48+
GAS then annotates the pod objects with unique increasing numeric timestamps in the annotation `gas-ts` and container card selections in `gas-container-cards` annotation. The latter has container separator '`|`' and card separator '`,`'. Example for a pod with two containers and both containers getting two cards: `gas-container-cards:card0,card1|card2,card3`.
49+
50+
Enabling the fractional resource support in the plugin without running GAS in the cluster will only slow down GPU-deployments, so do not enable this feature unnecessarily.
51+
52+
## Tile level access and Level Zero workloads
53+
54+
Level Zero library supports targeting different tiles on a GPU. If the host is equipped with multi-tile GPU devices, and the container requests both `gpu.intel.com/i915` and `gpu.intel.com/tiles` resources, GPU plugin (with GAS) adds an [affinity mask](https://spec.oneapi.io/level-zero/latest/core/PROG.html#affinity-mask) to the container. By default the mask is in "FLAT" [device hierarchy](https://spec.oneapi.io/level-zero/latest/core/PROG.html#device-hierarchy) format. With the affinity mask, two Level Zero workloads can share a two tile GPU so that workloads use one tile each.
55+
56+
If a multi-tile workload is intended to work in "COMPOSITE" hierarchy mode, the container spec environment should include hierarchy mode variable (ZE_FLAT_DEVICE_HIERARCHY) with "COMPOSITE" value. GPU plugin will then adapt the affinity mask from the default "FLAT" to "COMPOSITE" format.
57+
58+
If the GPU is a single tile device, GPU plugin does not set the affinity mask. Only exposing GPU devices is enough in that case.
59+
60+
### Details about tile resources
61+
62+
GAS makes the GPU and tile selection based on the Pod's resource specification. The selection is passed to GPU plugin via the Pod's annotation.
63+
64+
Tiles targeted for containers are specified to Pod via `gas-container-tiles` annotation where the the annotation value describes a set of card and tile combinations. For example in a two container pod, the annotation could be `gas-container-tiles:card0:gt0+gt1|card1:gt1,card2:gt0`. Similarly to `gas-container-cards`, the container details are split via `|`. In the example above, the first container gets tiles 0 and 1 from card 0, and the second container gets tile 1 from card 1 and tile 0 from card 2.

cmd/gpu_plugin/gpu_plugin.go

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -403,6 +403,29 @@ func (dp *devicePlugin) isCompatibleDevice(name string) bool {
403403
return true
404404
}
405405

406+
func (dp *devicePlugin) devSpecForDrmFile(drmFile string) (devSpec pluginapi.DeviceSpec, devPath string, err error) {
407+
if dp.controlDeviceReg.MatchString(drmFile) {
408+
//Skipping possible drm control node
409+
err = os.ErrInvalid
410+
411+
return
412+
}
413+
414+
devPath = path.Join(dp.devfsDir, drmFile)
415+
if _, err = os.Stat(devPath); err != nil {
416+
return
417+
}
418+
419+
// even querying metrics requires device to be writable
420+
devSpec = pluginapi.DeviceSpec{
421+
HostPath: devPath,
422+
ContainerPath: devPath,
423+
Permissions: "rw",
424+
}
425+
426+
return
427+
}
428+
406429
func (dp *devicePlugin) scan() (dpapi.DeviceTree, error) {
407430
files, err := os.ReadDir(dp.sysfsDir)
408431
if err != nil {
@@ -413,6 +436,7 @@ func (dp *devicePlugin) scan() (dpapi.DeviceTree, error) {
413436

414437
devTree := dpapi.NewDeviceTree()
415438
rmDevInfos := rm.NewDeviceInfoMap()
439+
tileCounts := []uint64{}
416440

417441
for _, f := range files {
418442
var nodes []pluginapi.DeviceSpec
@@ -429,25 +453,14 @@ func (dp *devicePlugin) scan() (dpapi.DeviceTree, error) {
429453
}
430454

431455
isPFwithVFs := pluginutils.IsSriovPFwithVFs(path.Join(dp.sysfsDir, f.Name()))
456+
tileCounts = append(tileCounts, labeler.GetTileCount(dp.sysfsDir, f.Name()))
432457

433458
for _, drmFile := range drmFiles {
434-
if dp.controlDeviceReg.MatchString(drmFile.Name()) {
435-
//Skipping possible drm control node
459+
devSpec, devPath, devSpecErr := dp.devSpecForDrmFile(drmFile.Name())
460+
if devSpecErr != nil {
436461
continue
437462
}
438463

439-
devPath := path.Join(dp.devfsDir, drmFile.Name())
440-
if _, err := os.Stat(devPath); err != nil {
441-
continue
442-
}
443-
444-
// even querying metrics requires device to be writable
445-
devSpec := pluginapi.DeviceSpec{
446-
HostPath: devPath,
447-
ContainerPath: devPath,
448-
Permissions: "rw",
449-
}
450-
451464
if !isPFwithVFs {
452465
klog.V(4).Infof("Adding %s to GPU %s", devPath, f.Name())
453466

@@ -487,6 +500,7 @@ func (dp *devicePlugin) scan() (dpapi.DeviceTree, error) {
487500

488501
if dp.resMan != nil {
489502
dp.resMan.SetDevInfos(rmDevInfos)
503+
dp.resMan.SetTileCountPerCard(tileCounts)
490504
}
491505

492506
return devTree, nil

cmd/gpu_plugin/gpu_plugin_test.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ func (m *mockResourceManager) GetPreferredFractionalAllocation(*v1beta1.Preferre
6161
return &v1beta1.PreferredAllocationResponse{}, &dpapi.UseDefaultMethodError{}
6262
}
6363

64+
func (m *mockResourceManager) SetTileCountPerCard(counts []uint64) {
65+
}
66+
6467
func createTestFiles(root string, devfsdirs, sysfsdirs []string, sysfsfiles map[string][]byte) (string, string, error) {
6568
sysfs := path.Join(root, "sys")
6669
devfs := path.Join(root, "dev")

0 commit comments

Comments
 (0)