-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Implement and use cuda graph plans #16548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
One thing to mention here is that on hip the generation of the graph is relatively more expensive than on cuda which makes the removal disable_due_to_too_many_updates have an perhaps outsized negative effect there. I will try to find the time to investigate. |
|
@AesSedai Could you give a try at gpt-oss-120b? I found that for some reason this PR applied to latest master works much slower for GLM 4.5 Air. Will need to investigate further. My testing shows gpt-oss-120b still has the same speedup. @IMbackK My wild guess is that if the graph gets reused for a few times it should pay off the graph building overhead. Without this PR it is understandable to have this |
|
I'll have to download it and set it up, but I should be able to run it later tonight. |
|
Looks like I screwed up the previous commits as the the graph update logic is completely broken. I can't even reproduce my previous results :( Anyway the issues should now be fixed. Refactored a bit and added back the @AesSedai @DocShotgun Thank you for your tests! Please check if this commit (3afbd9f) works. Performance comparison
|
|
Interesting. Seems like it's potentially beneficial in certain cases but not in others (and potentially a regression in some cases). Sadly not the free lunch I was hoping for lol. |
I just rebuilt fresh and did a 3-way comparison between mainline lcpp master, this PR, and ik_llama.cpp@main (including ikawrakow/ik_llama.cpp#829 which helps a lot with -ooae over there). tl;dr; this PR is possibly giving a bit over +2% TG over master with basically the same PP in quick llama-sweep-bench testing on my hybrid GPU+CPU gaming rig.
👈 Detailsmodel=/mnt/ai/models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-fa 1 \
-c 20480 \
-ub 4096 -b 4096 \
-ngl 99 \
-ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" \
-ot "ffn_(up|gate|down)_exps\.weight=CPU" \
--threads 16 \
--no-mmap
# for ik_llama.cpp add
-fmoe \
-ooae \
--warmup-batchik_llama.cpp/main@dbfd1515 -fmoe -ooae --warmup-batch
llama.cpp/master@683fa6ba
llama.cpp/wishstudio:cuda_graph_plan PR16548 3afbd9f rebased to 683fa6b + ug/port-sweep-bench
|
|
Super interesting PR! Off-topic to this PR itself, but relevant to the discussion: What is |
|
@DocShotgun I guess it's expected that this optimization is model and machine specific 😄 @DocShotgun @AesSedai @ubergarm Looks like all of you are running this in Linux and I'm running Win11. PP performance is very unstable in my tests but it looks much more stable on your tests. During my tests I also see lots of random latencies and slowdowns during kernel launches. This could explain that on my system this PR removes much more overhead than on your systems. But anyway even if this PR does not help much on your systems, I didn't expect it to cause any slowdown. Theoretically the only thing this PR changes on these MoE models are just making a consolidated @aendk I spent like 10 minutes searching for it and ended up with an example in ik_llama. |
I can confirm that kernel launch overheads are significantly more expensive and varying on Win11 compared to Linux for NVGPUS due to their batched dispatch via WDDM2. |
|
I've heard the same thing about kernel launch overhead being significantly higher on Windows than Linux. I notice you mention in the OP that you found building a graph and executing it only once to be always faster than calling the kernels individually - perhaps the lower kernel launch overhead on Linux makes this not necessarily true there? |
Its a long story but I believe Some folks use it to sample PP/TG speeds across a given kv-cache depth and use vibe coded python scripts to plot/visualize the results across multiple runs. I'd be happy to see it added officially but haven't fixed up the arguments enough to open a proper PR. Go ask over on https://huggingface.co/BeaverAI discord if u want more details or example usage. |
Another reason to switch to Linux 🥲
Could be. But in the last commit I introduced back the |
|
I tested this on windows, on GLM 4.6, on a 5090 + 9950x3d. I saw an average loss of about 8% compared to mainline, taking me from 5.2 to 4.7 tok/s using this quant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I welcome the proposal, I feel there are issues that still need to be addressed, namely:
- stricter equivalence checking when updating graph plans, and
- fixing the logic in the cuda backend
To sched some light on the perf gains here some profiles (with locally hacked fix of point 2 raised above), where I offloaded a single FFN with
sudo nsys profile --output=gp_offload --trace=cuda,nvtx,osrt --sample=cpu --gpu-metrics-devices=all --cuda-graph-trace=node ./build-x64-linux-gcc-reldbg/bin/llama-cli -m models/gemma-3n-E2B-it-Q8_0.gguf -fa 1 -p "Hello World" --top-k 1 -ot "blk\.1.ffn*=CPU" --no-conversation -n 10
One can see that CUDA Graphs nicely speed up the GPU portion of the workload (kuda kernels run 100% of the time, as we eliminate kernel-launch overheads). However, the CPU workload does not profit at all. Given that we are on heterogeneous systems, optimizing GPU workload may actually induce thermal throttling on CPU as CPU will be active more frequently in a given timeframe. Thus, slowdowns can occur in E2E tests.
More performant approaches for models that are too large to fit in vram would be ones where weights are streamed async onto the GPU (e.g. while previous workload is running either on GPU or CPU). Though such approaches are also more complex.
| // check if we are doing a graph update | ||
| if (cuda_graph->instance == nullptr && use_cuda_graph // no graph -> graph | ||
| || cuda_graph->instance != nullptr && !use_cuda_graph // graph -> no graph | ||
| || use_cuda_graph && is_cuda_graph_update_required(cuda_graph, cgraph)) { // graph property mismatch | ||
| cuda_graph->number_consecutive_updates++; | ||
| if (cuda_graph->number_consecutive_updates >= 4) { | ||
| cuda_ctx->disable_graph_due_to_too_many_updates = true; | ||
| use_cuda_graph = false; | ||
| } else { | ||
| cuda_graph_update_required = true; | ||
| } | ||
| } else { | ||
| cuda_graph->number_consecutive_updates = 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is faulty. Before, the update check on the cuda graph was performed for every decoded token, whereas now it's done only when sched->plan_dirty = true (which is set in ggml_backend_sched_split_graph, which is invoked only in ggml_backend_sched_alloc_graph/ggml_backend_sched_reserve). As a consequence, we no longer call the function when the cuda graph does not need to be updated, leading to number_conseuctive_updates being increased and cuda graphs being disabled eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new number_consecutive_computes tracker, maintaining it in ggml_backend_cuda_graph_plan_compute, and clears the number_consecutive_updates tracker when a graph is used for more than one time.
ggml/src/ggml-backend.cpp
Outdated
| struct ggml_backend_sched_plan * plans; | ||
| int n_plans; | ||
| int plans_capacity; | ||
| bool plan_dirty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bool plan_dirty; | |
| bool plan_needs_update; |
Nit: not sure if dirty is the appropriate wording here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
| if (sched->splits[i].backend_id != sched->plans[i].backend_id) { | ||
| create_new_plans = true; | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, we should actually check for the full equivalence of the generated splits (i.e. the ggml_cgraph objects contained in the splits & plans instead of just their backends)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current check is a bare minimum for ensuring correctness. It basically says "if the number of splits or backend of any split is changed we rebuilt everything". If only the interior of a split is changed, it will be delegated to the backend's graph update routine to handle the change (code is just next paragraph below this check). In current CUDA backend, if the graph change is too large the cudaGraphExecUpdate will fail and cudaGraphInstantiate will be called to reinstantiate a new executable graph. I think this will not generate too much overhead as the cudaGraphInstantiate is inevitable and only the cudaGraphExecUpdate call can be saved.
In order to improve this, I guess something like the enhanced API in #14514 needs to be implemented. It passes more information to help the backend make an informed decision on recreate vs update. Anyway I believe most scaffolds are missing so this would be out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, if the split graph is indeed identical, the CUDA backend will find out via the is_cuda_graph_update_required fast path. I don't think it's a good idea to move the checks up to the backend scheduler as different backends could have different definition of equivalence depending on various factors as discussed in #14514. We may calculate these factors in the scheduler but the ultimate decision is still better be done in the backend.
|
Added the |







This PR tries to implement graph plan APIs for the CUDA backend, as well as implement code in ggml-backend.cpp to actually use the graph plan APIs when a backend supports it.
The main functional improvement is to support cuda graphs when the graph is split (e.g. for hybrid inference). Currently the graph update and reuse logic (
ggml_backend_sched_update_plans) is a simple heuristic: only try updating previous graphs when the number of splits and their corresponding backends are the same as the previous run. As the benchmark shown this universally accelerate hybrid inference tg performance by up to 30%.The CUDA graph execution code is refactored and cleaned up. Two out of three original graph plan fail path are removed:
disable_due_to_failed_graph_captureanddisable_due_to_too_many_updates. The former one is due to the fact I found no code setting it to true. The latter is because I have currently no idea about the semantics in a split graph scenario. But it seems to not degrade the performance at all. Interestingly, I found that on my rig, even repeatitively build a graph then execute it only once is always faster than calling the kernels individually. I suspect it is the reason that the performance increased in tests even for CUDA only workloads, given this PR's optimization not targeting them. This of course needs to be verified on more hardware configurations.Performance comparison:
RTX 5090 + 13700k, 128GB 6400 MT/s RAM