vulkan: enumerate all non-CPU devices always, instead of only when missing a discrete GPU #15793

Peter0x44 · 2025-09-04T12:54:55Z

I have a laptop with both an NVIDIA RTX 4060 and Intel Arc Integrated GPU. I was trying to benchmark performance of the integrated GPU on inference (with Vulkan), but I noticed it wasn't getting enumerated.

./llama-server.exe --list-devices
load_backend: loaded RPC backend from vulkan\bin\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4060 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from vulkan\bin\ggml-vulkan.dll
load_backend: loaded CPU backend from vulkan\bin\ggml-cpu-alderlake.dll
Available devices:
  Vulkan0: NVIDIA GeForce RTX 4060 Laptop GPU (7957 MiB, 7189 MiB free)

So there was no way for me to pick it with --device.

As a workaround, exporting GGML_VK_VISIBLE_DEVICES="0,1" would let me pick the discrete GPU.

Is there any reason we only enumerate for integrated GPUs when we can't find a discrete one?
I think both should be shown to users by default, and there is no need to check vk_instance.device_indices.empty() before trying to enumerate the other sorts of GPUs the device might have.

…ssing a discrete GPU Previously, integrated GPUs were only enumerated as a fallback when no discrete GPUs were found. This change always enumerates all non-CPU Vulkan devices, allowing users to see and select from both integrated and discrete GPUs when both are present. Users retain full control via GGML_VK_VISIBLE_DEVICES environment variable.

0cc4m · 2025-09-04T12:58:57Z

The reason is that there are a ton of Vulkan-capable weak iGPUs out there that we don't want to use for performance reasons, because they would be slower than using the CPU. If we enabled iGPUs, it would try to split the model between the iGPU and the dGPU by default, and even put more layers onto the iGPU because it probably has more VRAM available.

A solution to this would have to be implemented in the ggml device API, not in the backend itself, but that hasn't been done so far because no other backend has this issue (and we are preventing it on the backend level).

Peter0x44 · 2025-09-04T13:01:44Z

Oh, I see, TY for explaining. That makes sense and definitely wouldn't be good.

slaren · 2025-09-04T13:04:02Z

A solution to this would have to be implemented in the ggml device API, not in the backend itself, but that hasn't been done so far because no other backend has this issue (and we are preventing it on the backend level).

I suppose we could add an "iGPU" device type, and ignore it by default in llama.cpp. Would that work?

0cc4m · 2025-09-04T13:06:45Z

A solution to this would have to be implemented in the ggml device API, not in the backend itself, but that hasn't been done so far because no other backend has this issue (and we are preventing it on the backend level).

I suppose we could add an "iGPU" device type, and ignore it by default in llama.cpp. Would that work?

If we move the "default to all dGPUs, if no dGPUs use iGPUs" logic to GGML, yeah, it would. We should open a discussion about cross-backend device selection logic as well.

Peter0x44 · 2025-09-04T13:06:51Z

The reason is that there are a ton of Vulkan-capable weak iGPUs out there that we don't want to use for performance reasons, because they would be slower than using the CPU.

Actually, then what determines how "weak" the GPU in question is? Is the point just to hide the GPU from layer splitting?

0cc4m · 2025-09-04T13:08:46Z

The reason is that there are a ton of Vulkan-capable weak iGPUs out there that we don't want to use for performance reasons, because they would be slower than using the CPU.

Actually, then what determines how "weak" the GPU in question is? Is the point just to hide the GPU from layer splitting?

Number of cores, memory bandwidth and device architecture. We don't have an easy way to query that information, for most devices.

Peter0x44 · 2025-09-04T13:10:22Z

then how is "enumerate integrated GPU if there is no dedicated" a suitable heuristic? The integrated one could still be worse than the CPU and you also don't have a dedicated GPU.

0cc4m · 2025-09-04T13:11:11Z

Yes, but if the user has no (suitable) GPU available they should just be using the CPU backend anyways, so that part is not really our problem.

0cc4m · 2025-09-04T13:13:50Z

But you are right, the heuristic I implemented in the backend for this is outdated now with the existence of GGML device selection options and should be reconsidered.

slaren · 2025-09-04T13:15:01Z

If we move the "default to all dGPUs, if no dGPUs use iGPUs" logic to GGML, yeah, it would.

The logic would be in the application side (llama.cpp), not in ggml, but it could be made to work that way. ggml needs to expose the device details, but only the application knows what devices it should use.

Btw, I noticed that iGPUs with Vulkan are often faster at prompt processing, but slower at generation. If the weights are stored in a host buffer that the CPU backend can use, then it would be possible to use the CPU backend for generation and Vulkan only for prompt processing.

0cc4m · 2025-09-04T13:26:10Z

If we move the "default to all dGPUs, if no dGPUs use iGPUs" logic to GGML, yeah, it would.

The logic would be in the application side (llama.cpp), not in ggml, but it could be made to work that way. ggml needs to expose the device details, but only the application knows what devices it should use.

Btw, I noticed that iGPUs with Vulkan are often faster at prompt processing, but slower at generation. If the weights are stored in a host buffer that the CPU backend can use, then it would be possible to use the CPU backend for generation and Vulkan only for prompt processing.

That is true, but those aren't the devices I worry about. For example there are Intel HD graphics since Intel Skylake that support Vulkan, but will just throw a DEVICE_LOST error when you try to use them with large compute shaders like the matmul one. Also, the new Ryzen desktop CPUs have iGPUs with 2 Compute Units, which is just barely enough to run a desktop. We can try to filter these kinds of devices, but it's not straightforward.

Large iGPUs like those used in laptop CPUs at some point cross the threshold to be faster at prompt processing, yeah. The really high-end ones like Strix Halo and Apple M-chips are better than CPU in all areas.

I don't think we can leave all of this logic to the application, since it's pretty deep into device details. Same with the decision to use e.g. a 3090 with CUDA or Vulkan backend if both are available. GGML should assist in some ways, maybe with a recommendation.

slaren · 2025-09-04T13:43:24Z

If we know that some iGPUs are too slow or broken to be usable in practice, then I think the right way to handle this is to blacklist them in the backend. We can still have an environment variable to force using these devices for testing purposes, but if otherwise they serve no practical purpose because the CPU is always faster, there is no reason to expose them.

Same with the decision to use e.g. a 3090 with CUDA or Vulkan backend if both are available. GGML should assist in some ways, maybe with a recommendation.

I think we could add a function to obtain the PCI bus ID of the device, and then it would be up to applications to avoid using the same device from multiple backends at the same time. With CUDA there is a cudaDeviceGetPCIBusId that should allow this. Is it possible to obtain this information in Vulkan?

Peter0x44 · 2025-09-04T13:57:38Z

Vulkan has a device ID:
this is the intel arc:
deviceID = 0x7d55
this is the NVIDIA laptop 4060:
deviceID = 0x28e0

I am not sure this corresponds to a PCI bus ID though.

slaren · 2025-09-04T14:07:00Z

CUDA returns a string in this format, which I believe is a fairly standard way of identifying PCI devices:

pciBusId

Returned identifier string for the device in the following format [domain]:[bus]:[device].[function] where domain, bus, device, and function are all hexadecimal values. pciBusId should be large enough to store 13 characters including the NULL-terminator.

Peter0x44 · 2025-09-04T14:09:12Z

Oh yes. Vulkan has an extension for retrieving that:

https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_pci_bus_info.html

0cc4m · 2025-09-04T14:09:36Z

If we know that some iGPUs are too slow or broken to be usable in practice, then I think the right way to handle this is to blacklist them in the backend. We can still have an environment variable to force using these devices for testing purposes, but if otherwise they serve no practical purpose because the CPU is always faster, there is no reason to expose them.

That is what the Vulkan backend is currently doing, but since it's hard to tell whether a device is worth using, we don't expose iGPUs unless no dGPUs are available. I guess that logic should stay then, but be refined to let through specific iGPUs through a new heuristic.

Same with the decision to use e.g. a 3090 with CUDA or Vulkan backend if both are available. GGML should assist in some ways, maybe with a recommendation.

I think we could add a function to obtain the PCI bus ID of the device, and then it would be up to applications to avoid using the same device from multiple backends at the same time. With CUDA there is a cudaDeviceGetPCIBusId that should allow this. Is it possible to obtain this information in Vulkan?

Yeah, I think we can, using VK_EXT_pci_bus_info.

Peter0x44 · 2025-09-04T14:11:16Z

Okay, this seems beyond what I'm able or willing to implement in this PR right now, it probably should just be split off to separate issues. Please ping me in any you create.

0cc4m · 2025-09-04T15:15:56Z

Okay, this seems beyond what I'm able or willing to implement in this PR right now, it probably should just be split off to separate issues. Please ping me in any you create.

Yeah, but thank you for poking us to look into this again.

slaren · 2025-09-04T15:17:09Z

#15797

Peter0x44 requested a review from 0cc4m as a code owner September 4, 2025 12:54

Peter0x44 closed this Sep 4, 2025

Peter0x44 reopened this Sep 4, 2025

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 4, 2025

Peter0x44 closed this Sep 4, 2025

Peter0x44 deleted the vulkan-enumerate-all-devices branch September 4, 2025 15:36

vulkan: enumerate all non-CPU devices always, instead of only when missing a discrete GPU #15793

vulkan: enumerate all non-CPU devices always, instead of only when missing a discrete GPU #15793

Uh oh!

Conversation

Peter0x44 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

Peter0x44 commented Sep 4, 2025

Uh oh!

slaren commented Sep 4, 2025

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

Peter0x44 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

Peter0x44 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

slaren commented Sep 4, 2025

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

slaren commented Sep 4, 2025

Uh oh!

Peter0x44 commented Sep 4, 2025

Uh oh!

slaren commented Sep 4, 2025

Uh oh!

Peter0x44 commented Sep 4, 2025

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

Peter0x44 commented Sep 4, 2025

Uh oh!

0cc4m commented Sep 4, 2025

Uh oh!

slaren commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Peter0x44 commented Sep 4, 2025 •

edited

Loading

Peter0x44 commented Sep 4, 2025 •

edited

Loading

Peter0x44 commented Sep 4, 2025 •

edited

Loading