Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

This lets the copy to the destination device use the host-visible vidmem optimization.

Noticed this while looking at #16767. But it helps devices that support host-visible vidmem, which I think is not what the bug is really about.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -p 0 -n 128,128,128,128,128 -fa 1 -r 30 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -sm layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       197.63 ± 27.82 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        206.10 ± 8.94 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        206.18 ± 8.90 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        206.58 ± 7.41 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        203.64 ± 8.23 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -p 0 -n 128,128,128,128,128 -fa 1 -r 30 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -sm layer
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       207.61 ± 26.07 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        215.62 ± 6.20 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        215.64 ± 5.95 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        215.61 ± 5.82 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        214.82 ± 6.04 |

This lets the copy to the destination device use the host-visible
vidmem optimization.
@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner October 27, 2025 02:18
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 27, 2025
@0cc4m 0cc4m merged commit f549b00 into ggml-org:master Oct 29, 2025
72 checks passed
wqerrewetw added a commit to wqerrewetw/llama.cpp that referenced this pull request Oct 29, 2025
* Ci (#11) (#12)

* Fix cl (#7)

* Rename build-amd.yml to build-amd.yml.disabled

* Rename winget.yml to winget.yml.disabled

* Rename server.yml to server.yml.disabled

* Rename build.yml to build.yml.disabled

* Update release.yml

* Rename build-cmake-pkg.yml to build-cmake-pkg.yml.disabled

* Rename build-linux-cross.yml to build-linux-cross.yml.disabled

* Rename build-riscv-native.yml.disabled to build-riscv-native.yml

* Rename docker.yml.disabled to docker.yml

* Rename update-ops-docs.yml to update-ops-docs.yml.disabled

* Remove macOS-arm64 job from release workflow

Removed macOS-arm64 job and its associated steps from the release workflow.

* CUDA: Fix bug in topk-moe for gpt-oss (ggml-org#16821)

* CUDA: Fix bug in topk-moe for gpt-oss

When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph),
but it actually doesn't fuse in the actual gpt-oss

* fix for qwen3 too

* change ifndef to ifdef

* vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (ggml-org#16793)

This lets the copy to the destination device use the host-visible
vidmem optimization.

---------

Co-authored-by: Aman Gupta <[email protected]>
Co-authored-by: Jeff Bolz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants