-
Notifications
You must be signed in to change notification settings - Fork 13.3k
vulkan: enumerate all non-CPU devices always, instead of only when missing a discrete GPU #15793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ssing a discrete GPU Previously, integrated GPUs were only enumerated as a fallback when no discrete GPUs were found. This change always enumerates all non-CPU Vulkan devices, allowing users to see and select from both integrated and discrete GPUs when both are present. Users retain full control via GGML_VK_VISIBLE_DEVICES environment variable.
The reason is that there are a ton of Vulkan-capable weak iGPUs out there that we don't want to use for performance reasons, because they would be slower than using the CPU. If we enabled iGPUs, it would try to split the model between the iGPU and the dGPU by default, and even put more layers onto the iGPU because it probably has more VRAM available. A solution to this would have to be implemented in the ggml device API, not in the backend itself, but that hasn't been done so far because no other backend has this issue (and we are preventing it on the backend level). |
Oh, I see, TY for explaining. That makes sense and definitely wouldn't be good. |
I suppose we could add an "iGPU" device type, and ignore it by default in llama.cpp. Would that work? |
If we move the "default to all dGPUs, if no dGPUs use iGPUs" logic to GGML, yeah, it would. We should open a discussion about cross-backend device selection logic as well. |
Actually, then what determines how "weak" the GPU in question is? Is the point just to hide the GPU from layer splitting? |
Number of cores, memory bandwidth and device architecture. We don't have an easy way to query that information, for most devices. |
then how is "enumerate integrated GPU if there is no dedicated" a suitable heuristic? The integrated one could still be worse than the CPU and you also don't have a dedicated GPU. |
Yes, but if the user has no (suitable) GPU available they should just be using the CPU backend anyways, so that part is not really our problem. |
But you are right, the heuristic I implemented in the backend for this is outdated now with the existence of GGML device selection options and should be reconsidered. |
The logic would be in the application side (llama.cpp), not in ggml, but it could be made to work that way. ggml needs to expose the device details, but only the application knows what devices it should use. Btw, I noticed that iGPUs with Vulkan are often faster at prompt processing, but slower at generation. If the weights are stored in a host buffer that the CPU backend can use, then it would be possible to use the CPU backend for generation and Vulkan only for prompt processing. |
That is true, but those aren't the devices I worry about. For example there are Intel HD graphics since Intel Skylake that support Vulkan, but will just throw a DEVICE_LOST error when you try to use them with large compute shaders like the matmul one. Also, the new Ryzen desktop CPUs have iGPUs with 2 Compute Units, which is just barely enough to run a desktop. We can try to filter these kinds of devices, but it's not straightforward. Large iGPUs like those used in laptop CPUs at some point cross the threshold to be faster at prompt processing, yeah. The really high-end ones like Strix Halo and Apple M-chips are better than CPU in all areas. I don't think we can leave all of this logic to the application, since it's pretty deep into device details. Same with the decision to use e.g. a 3090 with CUDA or Vulkan backend if both are available. GGML should assist in some ways, maybe with a recommendation. |
If we know that some iGPUs are too slow or broken to be usable in practice, then I think the right way to handle this is to blacklist them in the backend. We can still have an environment variable to force using these devices for testing purposes, but if otherwise they serve no practical purpose because the CPU is always faster, there is no reason to expose them.
I think we could add a function to obtain the PCI bus ID of the device, and then it would be up to applications to avoid using the same device from multiple backends at the same time. With CUDA there is a |
Vulkan has a device ID: I am not sure this corresponds to a PCI bus ID though. |
CUDA returns a string in this format, which I believe is a fairly standard way of identifying PCI devices:
|
Oh yes. Vulkan has an extension for retrieving that: https://registry.khronos.org/vulkan/specs/latest/man/html/VK_EXT_pci_bus_info.html |
That is what the Vulkan backend is currently doing, but since it's hard to tell whether a device is worth using, we don't expose iGPUs unless no dGPUs are available. I guess that logic should stay then, but be refined to let through specific iGPUs through a new heuristic.
Yeah, I think we can, using VK_EXT_pci_bus_info. |
Okay, this seems beyond what I'm able or willing to implement in this PR right now, it probably should just be split off to separate issues. Please ping me in any you create. |
Yeah, but thank you for poking us to look into this again. |
I have a laptop with both an NVIDIA RTX 4060 and Intel Arc Integrated GPU. I was trying to benchmark performance of the integrated GPU on inference (with Vulkan), but I noticed it wasn't getting enumerated.
So there was no way for me to pick it with --device.
As a workaround, exporting
GGML_VK_VISIBLE_DEVICES="0,1"
would let me pick the discrete GPU.Is there any reason we only enumerate for integrated GPUs when we can't find a discrete one?
I think both should be shown to users by default, and there is no need to check vk_instance.device_indices.empty() before trying to enumerate the other sorts of GPUs the device might have.