Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

before:
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.93 ± 0.59 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     | 1000 |         tg128 |        100.70 ± 0.71 |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     | 1000 |         tg128 |         73.39 ± 0.61 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |         92.45 ± 1.01 |

after:
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        112.04 ± 0.39 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     | 1000 |         tg128 |        100.59 ± 0.17 |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     | 1000 |         tg128 |         73.90 ± 0.25 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     | 1000 |         tg128 |         95.13 ± 0.75 |

I did some timings of how long it takes to do the dryrun, and get to the first submit. These numbers are averaged over 32 evaluations of the model, and are all in microseconds. For Llama-3.2-3B-Instruct-Q8_0.gguf:

before:
dryRunTime 122 firstSubmitTime 238 beforeLastSubmitTime 1600 totalTime 10070
after:
dryRunTime 84 firstSubmitTime 130 beforeLastSubmitTime 1639 totalTime 9725

So dryRunTime and firstSubmitTime are the amount of time spent before we submit any work to the GPU (these are disjoint, i.e. firstSubmitTime does not include dryRunTime). beforeLastSubmitTime is roughly the total CPU time, and totalTime is roughly the GPU time. (Note that the GPU time seems to vary from execution to execution, I think maybe related to KV cache).

So before we had about 0.35ms of GPU idle time out of 10ms of total GPU time, and this reduces it to about 0.21ms, corresponding to around a 1% speedup. The numbers all tend to be a little noisy, the table above shows +3% for this model, but there's a clear improvement and it's generally aligned with the measurements I did of this idle bubble.

This is an incremental improvement over ggml-org#9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
@jeffbolznv jeffbolznv requested a review from 0cc4m November 25, 2024 16:54
@jeffbolznv
Copy link
Collaborator Author

CC @mtavenrath

@jeffbolznv jeffbolznv added the Vulkan Issues specific to the Vulkan backend label Nov 28, 2024
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reproduced the slight performance improvement. Looks good.

@0cc4m 0cc4m merged commit f095a64 into ggml-org:master Nov 29, 2024
54 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
This is an incremental improvement over ggml-org#9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants