Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

I recently got a new test system that supports resizable BAR, and found that models that mostly fill vidmem have poor performance, e.g. this 9GB model on a 12GB GPU:

using host-visible vidmem:
llama-bench.exe --model starcoder2-15b-Q4_0.gguf -ngl 1000
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         pp512 |       306.18 ± 29.64 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         tg128 |          1.92 ± 0.01 |

after this fix:
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         pp512 |      1539.45 ± 17.87 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     | 1000 |         tg128 |         39.14 ± 1.90 |

What's happening is we're allocating two 4GB host-visible vidmem buffers, which due to OS limitations have to be contiguous, and the OS isn't able to fit one of them in vidmem due to fragmentation and it ends up in sysmem. This doesn't affect non-host-visible vidmem because the driver can split it into smaller pieces.

This PR avoids this by choosing not to use host-visible vidmem one we get "mostly full" (with a kind of arbitrary heuristic). But there's no guarantee this works in all cases, and I'm not hung up on this particular solution. Some other options might be: stop using HVV entirely (I think CUDA doesn't use it today) or don't always use maximum sized allocations (probably requires ggml changes).

I'm not entirely sure what configurations are affected by this fragmentation issue. It at least affects NVIDIA on Windows, and I'd guess other dGPUs on Windows. I'm not sure about Linux. I suspect it doesn't affect UMA Windows systems.

@jeffbolznv jeffbolznv requested a review from 0cc4m January 30, 2025 17:18
@slaren
Copy link
Member

slaren commented Jan 30, 2025

or don't always use maximum sized allocations (probably requires ggml changes)

We should probably modify ggml_backend_alloc_ctx_tensors_from_buft so that it tries to allocate a buffer even when a tensor is bigger than the maximum buffer size (currently it just fails). That way you could define a very small max buffer size, and it would allocate one tensor per buffer, which is probably fine in most cases and avoids issues with backends that may have trouble allocating a large amount of contiguous memory.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 30, 2025

What's happening is we're allocating two 4GB host-visible vidmem buffers, which due to OS limitations have to be contiguous, and the OS isn't able to fit one of them in vidmem due to fragmentation and it ends up in sysmem. This doesn't affect non-host-visible vidmem because the driver can split it into smaller pieces.

I haven't seen this happen. This kinda sounds like bad behaviour by the driver to me. If I request deviceLocal memory, it shouldn't silently fall back to device-visible system memory. If there is not enough (contiguous) space, it should throw an error and allow me to choose a different configuration. I think that is how it works on other drivers.

Does this still happen if you disable the Nvidia driver shared memory fallback?

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 30, 2025
@jeffbolznv
Copy link
Collaborator Author

This fallback doesn't happen in the driver. On Windows, the OS has a video memory manager that pages allocations around as needed. The usermode driver prevents allocating more than the total size of vidmem (unless you use the overallocation extension), and in this case the total memory usage is only about 3/4 of the total available vidmem. But when the video memory manager tries to fit the allocations in vidmem (which happens well after allocation), it can't, due to some allocations (like scanout surfaces) that can't be moved. So it ends up putting the allocation in sysmem instead.

Does this still happen if you disable the Nvidia driver shared memory fallback?

If you mean the behavior where the OS is allowed to place vidmem allocations in sysmem if needed, then it's not viable to disable that, applications will just fail under memory stress.

We should probably modify ggml_backend_alloc_ctx_tensors_from_buft so that it tries to allocate a buffer even when a tensor is bigger than the maximum buffer size (currently it just fails). That way you could define a very small max buffer size, and it would allocate one tensor per buffer

I'll try this. It may still make sense to limit the amount of HVV that gets allocated, but I've tested for example that using 2GB allocations rather than 4GB allocations restores the performance.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 30, 2025

Does this still happen if you disable the Nvidia driver shared memory fallback?

If you mean the behavior where the OS is allowed to place vidmem allocations in sysmem if needed, then it's not viable to disable that, applications will just fail under memory stress.

I meant the sysmem fallback policy, but it seems that's CUDA-only.

@jeffbolznv
Copy link
Collaborator Author

I've lost my repro (probably due to rebooting). Coding up the change to allow allocating more than the size the backend says looks pretty simple. Maybe I should just finish that off and ditch the rest of the change?

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 30, 2025

I've lost my repro (probably due to rebooting). Coding up the change to allow allocating more than the size the backend says looks pretty simple. Maybe I should just finish that off and ditch the rest of the change?

We had a buffer for each tensor in the beginning, but I think large allocations are recommended by Vulkan performance guides. I also saw some benchmarks where it makes a difference. If your driver problem can also be handled by reducing the buffer size limit with GGML_VK_FORCE_MAX_ALLOCATION_SIZE, maybe that's enough?

@jeffbolznv
Copy link
Collaborator Author

With the change @slaren suggested, the vulkan backend can return a value that effectively is a suballocator block size. So it doesn't need to go all the way down to individual allocations.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 30, 2025

With the change @slaren suggested, the vulkan backend can return a value that effectively is a suballocator block size. So it doesn't need to go all the way down to individual allocations.

True, but you would still have to decide when to show a max alloc size below the actual one.

@jeffbolznv
Copy link
Collaborator Author

Closing this in favor of #11551.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants