vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy #16793

jeffbolznv · 2025-10-27T02:18:01Z

This lets the copy to the destination device use the host-visible vidmem optimization.

Noticed this while looking at #16767. But it helps devices that support host-visible vidmem, which I think is not what the bug is really about.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -p 0 -n 128,128,128,128,128 -fa 1 -r 30 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -sm layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       197.63 ± 27.82 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        206.10 ± 8.94 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        206.18 ± 8.90 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        206.58 ± 7.41 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        203.64 ± 8.23 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -p 0 -n 128,128,128,128,128 -fa 1 -r 30 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -sm layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       207.61 ± 26.07 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        215.62 ± 6.20 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        215.64 ± 5.95 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        215.61 ± 5.82 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        214.82 ± 6.04 |

This lets the copy to the destination device use the host-visible vidmem optimization.

* Ci (#11) (#12) * Fix cl (#7) * Rename build-amd.yml to build-amd.yml.disabled * Rename winget.yml to winget.yml.disabled * Rename server.yml to server.yml.disabled * Rename build.yml to build.yml.disabled * Update release.yml * Rename build-cmake-pkg.yml to build-cmake-pkg.yml.disabled * Rename build-linux-cross.yml to build-linux-cross.yml.disabled * Rename build-riscv-native.yml.disabled to build-riscv-native.yml * Rename docker.yml.disabled to docker.yml * Rename update-ops-docs.yml to update-ops-docs.yml.disabled * Remove macOS-arm64 job from release workflow Removed macOS-arm64 job and its associated steps from the release workflow. * CUDA: Fix bug in topk-moe for gpt-oss (ggml-org#16821) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef * vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (ggml-org#16793) This lets the copy to the destination device use the host-visible vidmem optimization. --------- Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Jeff Bolz <[email protected]>

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy

94e549c

This lets the copy to the destination device use the host-visible vidmem optimization.

jeffbolznv requested a review from 0cc4m as a code owner October 27, 2025 02:18

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 27, 2025

0cc4m approved these changes Oct 29, 2025

View reviewed changes

0cc4m merged commit f549b00 into ggml-org:master Oct 29, 2025
72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy #16793

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy #16793

jeffbolznv commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy #16793

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy #16793

Conversation

jeffbolznv commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants