-
Notifications
You must be signed in to change notification settings - Fork 13.6k
vulkan: Avoid using too much host-visible vidmem, which can lead to fragmentation #11520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We should probably modify |
I haven't seen this happen. This kinda sounds like bad behaviour by the driver to me. If I request deviceLocal memory, it shouldn't silently fall back to device-visible system memory. If there is not enough (contiguous) space, it should throw an error and allow me to choose a different configuration. I think that is how it works on other drivers. Does this still happen if you disable the Nvidia driver shared memory fallback? |
|
This fallback doesn't happen in the driver. On Windows, the OS has a video memory manager that pages allocations around as needed. The usermode driver prevents allocating more than the total size of vidmem (unless you use the overallocation extension), and in this case the total memory usage is only about 3/4 of the total available vidmem. But when the video memory manager tries to fit the allocations in vidmem (which happens well after allocation), it can't, due to some allocations (like scanout surfaces) that can't be moved. So it ends up putting the allocation in sysmem instead.
If you mean the behavior where the OS is allowed to place vidmem allocations in sysmem if needed, then it's not viable to disable that, applications will just fail under memory stress.
I'll try this. It may still make sense to limit the amount of HVV that gets allocated, but I've tested for example that using 2GB allocations rather than 4GB allocations restores the performance. |
I meant the sysmem fallback policy, but it seems that's CUDA-only. |
|
I've lost my repro (probably due to rebooting). Coding up the change to allow allocating more than the size the backend says looks pretty simple. Maybe I should just finish that off and ditch the rest of the change? |
We had a buffer for each tensor in the beginning, but I think large allocations are recommended by Vulkan performance guides. I also saw some benchmarks where it makes a difference. If your driver problem can also be handled by reducing the buffer size limit with |
|
With the change @slaren suggested, the vulkan backend can return a value that effectively is a suballocator block size. So it doesn't need to go all the way down to individual allocations. |
True, but you would still have to decide when to show a max alloc size below the actual one. |
|
Closing this in favor of #11551. |
I recently got a new test system that supports resizable BAR, and found that models that mostly fill vidmem have poor performance, e.g. this 9GB model on a 12GB GPU:
What's happening is we're allocating two 4GB host-visible vidmem buffers, which due to OS limitations have to be contiguous, and the OS isn't able to fit one of them in vidmem due to fragmentation and it ends up in sysmem. This doesn't affect non-host-visible vidmem because the driver can split it into smaller pieces.
This PR avoids this by choosing not to use host-visible vidmem one we get "mostly full" (with a kind of arbitrary heuristic). But there's no guarantee this works in all cases, and I'm not hung up on this particular solution. Some other options might be: stop using HVV entirely (I think CUDA doesn't use it today) or don't always use maximum sized allocations (probably requires ggml changes).
I'm not entirely sure what configurations are affected by this fragmentation issue. It at least affects NVIDIA on Windows, and I'd guess other dGPUs on Windows. I'm not sure about Linux. I suspect it doesn't affect UMA Windows systems.