vulkan: Avoid using too much host-visible vidmem, which can lead to fragmentation #11520

jeffbolznv · 2025-01-30T17:18:43Z

I recently got a new test system that supports resizable BAR, and found that models that mostly fill vidmem have poor performance, e.g. this 9GB model on a 12GB GPU:

using host-visible vidmem:
llama-bench.exe --model starcoder2-15b-Q4_0.gguf -ngl 1000
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         pp512 |       306.18 ± 29.64 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         tg128 |          1.92 ± 0.01 |

after this fix:
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         pp512 |      1539.45 ± 17.87 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         tg128 |         39.14 ± 1.90 |

What's happening is we're allocating two 4GB host-visible vidmem buffers, which due to OS limitations have to be contiguous, and the OS isn't able to fit one of them in vidmem due to fragmentation and it ends up in sysmem. This doesn't affect non-host-visible vidmem because the driver can split it into smaller pieces.

This PR avoids this by choosing not to use host-visible vidmem one we get "mostly full" (with a kind of arbitrary heuristic). But there's no guarantee this works in all cases, and I'm not hung up on this particular solution. Some other options might be: stop using HVV entirely (I think CUDA doesn't use it today) or don't always use maximum sized allocations (probably requires ggml changes).

I'm not entirely sure what configurations are affected by this fragmentation issue. It at least affects NVIDIA on Windows, and I'd guess other dGPUs on Windows. I'm not sure about Linux. I suspect it doesn't affect UMA Windows systems.

…ragmentation

slaren · 2025-01-30T17:23:08Z

or don't always use maximum sized allocations (probably requires ggml changes)

We should probably modify ggml_backend_alloc_ctx_tensors_from_buft so that it tries to allocate a buffer even when a tensor is bigger than the maximum buffer size (currently it just fails). That way you could define a very small max buffer size, and it would allocate one tensor per buffer, which is probably fine in most cases and avoids issues with backends that may have trouble allocating a large amount of contiguous memory.

0cc4m · 2025-01-30T17:36:47Z

What's happening is we're allocating two 4GB host-visible vidmem buffers, which due to OS limitations have to be contiguous, and the OS isn't able to fit one of them in vidmem due to fragmentation and it ends up in sysmem. This doesn't affect non-host-visible vidmem because the driver can split it into smaller pieces.

I haven't seen this happen. This kinda sounds like bad behaviour by the driver to me. If I request deviceLocal memory, it shouldn't silently fall back to device-visible system memory. If there is not enough (contiguous) space, it should throw an error and allow me to choose a different configuration. I think that is how it works on other drivers.

Does this still happen if you disable the Nvidia driver shared memory fallback?

jeffbolznv · 2025-01-30T17:52:25Z

This fallback doesn't happen in the driver. On Windows, the OS has a video memory manager that pages allocations around as needed. The usermode driver prevents allocating more than the total size of vidmem (unless you use the overallocation extension), and in this case the total memory usage is only about 3/4 of the total available vidmem. But when the video memory manager tries to fit the allocations in vidmem (which happens well after allocation), it can't, due to some allocations (like scanout surfaces) that can't be moved. So it ends up putting the allocation in sysmem instead.

Does this still happen if you disable the Nvidia driver shared memory fallback?

If you mean the behavior where the OS is allowed to place vidmem allocations in sysmem if needed, then it's not viable to disable that, applications will just fail under memory stress.

We should probably modify ggml_backend_alloc_ctx_tensors_from_buft so that it tries to allocate a buffer even when a tensor is bigger than the maximum buffer size (currently it just fails). That way you could define a very small max buffer size, and it would allocate one tensor per buffer

I'll try this. It may still make sense to limit the amount of HVV that gets allocated, but I've tested for example that using 2GB allocations rather than 4GB allocations restores the performance.

0cc4m · 2025-01-30T18:00:47Z

Does this still happen if you disable the Nvidia driver shared memory fallback?

If you mean the behavior where the OS is allowed to place vidmem allocations in sysmem if needed, then it's not viable to disable that, applications will just fail under memory stress.

I meant the sysmem fallback policy, but it seems that's CUDA-only.

jeffbolznv · 2025-01-30T20:56:51Z

I've lost my repro (probably due to rebooting). Coding up the change to allow allocating more than the size the backend says looks pretty simple. Maybe I should just finish that off and ditch the rest of the change?

0cc4m · 2025-01-30T21:02:42Z

I've lost my repro (probably due to rebooting). Coding up the change to allow allocating more than the size the backend says looks pretty simple. Maybe I should just finish that off and ditch the rest of the change?

We had a buffer for each tensor in the beginning, but I think large allocations are recommended by Vulkan performance guides. I also saw some benchmarks where it makes a difference. If your driver problem can also be handled by reducing the buffer size limit with GGML_VK_FORCE_MAX_ALLOCATION_SIZE, maybe that's enough?

jeffbolznv · 2025-01-30T21:07:56Z

With the change @slaren suggested, the vulkan backend can return a value that effectively is a suballocator block size. So it doesn't need to go all the way down to individual allocations.

0cc4m · 2025-01-30T21:28:49Z

With the change @slaren suggested, the vulkan backend can return a value that effectively is a suballocator block size. So it doesn't need to go all the way down to individual allocations.

True, but you would still have to decide when to show a max alloc size below the actual one.

jeffbolznv · 2025-01-31T15:18:17Z

Closing this in favor of #11551.

vulkan: Avoid using too much host-visible vidmem, which can lead to f…

7cfadd7

…ragmentation

jeffbolznv requested a review from 0cc4m January 30, 2025 17:18

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 30, 2025

jeffbolznv mentioned this pull request Jan 31, 2025

vulkan: use smaller combined allocations to avoid fragmentation #11551

Merged

jeffbolznv closed this Jan 31, 2025

Acly mentioned this pull request Sep 5, 2025

ggml : split graph allocations according to backend max buffer size #15815

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Avoid using too much host-visible vidmem, which can lead to fragmentation #11520

vulkan: Avoid using too much host-visible vidmem, which can lead to fragmentation #11520

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

slaren commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vulkan: Avoid using too much host-visible vidmem, which can lead to fragmentation #11520

vulkan: Avoid using too much host-visible vidmem, which can lead to fragmentation #11520

Uh oh!

Conversation

jeffbolznv commented Jan 30, 2025

Uh oh!

slaren commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 30, 2025

Uh oh!

0cc4m commented Jan 30, 2025

Uh oh!

jeffbolznv commented Jan 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants