CUDA: add expert reduce kernel #16857

am17an · 2025-10-30T08:10:22Z

This PR adds a kernel to fuse the common MoE mul + (n_expert_used-1)*add operations into one. 1-2% TG speed-up depending on num_expert_used

Tested on a 4090

Model	Test	t/s master	t/s expert-reduce	Speedup
gpt-oss 20B MXFP4 MoE	tg32	198.21	200.64	1.01
gpt-oss 20B MXFP4 MoE	tg64	195.64	197.59	1.01
gpt-oss 20B MXFP4 MoE	tg128	193.52	194.77	1.01
qwen3moe 30B.A3B Q4_0	tg32	167.77	171.98	1.03
qwen3moe 30B.A3B Q4_0	tg64	161.53	165.00	1.02
qwen3moe 30B.A3B Q4_0	tg128	159.33	162.50	1.02

am17an · 2025-10-30T08:12:15Z

ggml/src/ggml-cuda/moe-expert-reduce.cu

+    } else {
+#pragma unroll
+        for (int i = 0; i < n_expert_used_template; ++i) {
+            ggml_cuda_mad(acc, experts[col], weights[i]);


I tried loading weights into shared memory/registers, but it doesn't really make a difference as the memory slice per row is extremely small (n_expert_used floats per row)

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/moe-expert-reduce.cu

Co-authored-by: Johannes Gäßler <[email protected]>

am17an · 2025-10-31T12:05:04Z

Failures seem unrelated, merging

CISC · 2025-10-31T15:58:41Z

Failures seem unrelated, merging

Not quite, the new test seems to fail spectacularly on webgpu. @reeselevine
https://github.com/ggml-org/llama.cpp/actions/runs/18971988336/job/54181756009#step:7:40822

am17an · 2025-10-31T16:03:19Z

It looks like the webGPU build is also failing, it was failing earlier too.

reeselevine · 2025-10-31T19:16:25Z

Ok, I see the error here, I'll need to do some investigation to see why it's not using an "inplace" version of the add operation in the new added tests. I'll investigate as soon as I can, in the meantime do we want to disable the webgpu tests to avoid confusion on all new PRs?

reeselevine · 2025-10-31T19:23:41Z

From my understanding, it looks like instead of a full overlap in buffers, there is a partial overlap in the two src buffers, so it's not the same as other inplace operations, which have src0=dst and src1 fully disjoint.

That's not something the code currently expects, and it seems like it might be unique to these new added tests?

I'll have to update the logic to do some sort of merging on buffer bindings to handle this, or as a more temporary fix, see if I can disable support for operations if it does have this sort of partial overlap.

CISC · 2025-10-31T20:23:20Z

The CI failures are ok for a while, as long as they don't impact any ongoing webgpu work.

reeselevine · 2025-10-31T22:15:27Z

I'm actually a little confused about the test added here. Specifically, looking at this line: https://github.com/ggml-org/llama.cpp/pull/16857/files#diff-2749fdb8974ec96afa18444a9d546409318b0a862709139b677eee468c479578R4778, it seems like the 5th argument to ggml_view_2d should be weighted->nb[1], not weighted->nb[2]. This is based on the definition of ggml_view_2d.

Because, if I disable buffer aliasing validation, as is done in this PR: #16810, the new tests do not pass with that line as currently written, but they do pass if it is changed to nb[1]. However, tests for other backends are passing currently, so perhaps I'm not understanding something correctly, and I don't have much context on the mixture of experts algorithm in general, so that's very possible.

I will say though that at least testing locally on the Metal backend, the new tests pass regardless of whether it's nb[2] or nb[1], which might because the other backends are using fused addition kernels?

am17an · 2025-11-01T03:19:49Z

That test is a reconstruction of llama-graph.cpp

llama.cpp/src/llama-graph.cpp

Lines 1116 to 1142 in bea0452

    
               // order the views before the adds 
        
               for (uint32_t i = 0; i < hparams.n_expert_used; ++i) { 
        
                   cur_experts[i] = ggml_view_2d(ctx0, experts, n_embd, n_tokens, experts->nb[2], i*experts->nb[1]); 
        
                   ggml_build_forward_expand(gf, cur_experts[i]); 
        
               } 
        
               // aggregate experts 
        
               // note: here we explicitly use hparams.n_expert_used instead of n_expert_used 
        
               //       to avoid potentially a large number of add nodes during warmup 
        
               //       ref: https://github.com/ggml-org/llama.cpp/pull/14753 
        
               ggml_tensor * moe_out = cur_experts[0]; 
        
               for (uint32_t i = 1; i < hparams.n_expert_used; ++i) { 
        
                   moe_out = ggml_add(ctx0, moe_out, cur_experts[i]); 
        
               } 
        
               if (hparams.n_expert_used == 1) { 
        
                   // avoid returning a non-contiguous tensor 
        
                   moe_out = ggml_cont(ctx0, moe_out); 
        
               } 
        
               cb(moe_out, "ffn_moe_out", il); 
        
               return moe_out; 
        
           }

reeselevine · 2025-11-02T04:37:30Z

Ok, I realized this was due to the non-contiguity introduced in the views for this reduction, so I will disable support for these operations in #16810 and leave a note so that it can be added in the future. I don't think anything else is needed here!

This reverts commit 4146d6a.

am17an requested a review from slaren as a code owner October 30, 2025 08:10

am17an commented Oct 30, 2025

View reviewed changes

am17an requested a review from JohannesGaessler October 30, 2025 08:12

CUDA: add expert reduce kernel

4999b21

am17an force-pushed the expert-reduce branch from 2e8e339 to 4999b21 Compare October 30, 2025 08:25

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 30, 2025

JohannesGaessler reviewed Oct 30, 2025

View reviewed changes

contigous checks, better formatting, use std::vector instead of array

e765d9a

am17an requested a review from JohannesGaessler October 31, 2025 06:02

JohannesGaessler approved these changes Oct 31, 2025

View reviewed changes

ggml/src/ggml-cuda/moe-expert-reduce.cu Outdated Show resolved Hide resolved

use vector empty instead of size

2c10f1c

Co-authored-by: Johannes Gäßler <[email protected]>

am17an merged commit 4146d6a into ggml-org:master Oct 31, 2025
66 of 69 checks passed

am17an deleted the expert-reduce branch October 31, 2025 12:05

reeselevine mentioned this pull request Nov 2, 2025

ggml webgpu: minor set rows optimization #16810

Merged

am17an mentioned this pull request Nov 6, 2025

Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion) #16912

Open

am17an added a commit to am17an/llama.cpp that referenced this pull request Nov 8, 2025

Revert "CUDA: add expert reduce kernel (ggml-org#16857)"

4d1daaf

This reverts commit 4146d6a.

am17an added a commit that referenced this pull request Nov 8, 2025

Revert "CUDA: add expert reduce kernel (#16857)" (#17100)

64fe17f

CUDA: add expert reduce kernel #16857

CUDA: add expert reduce kernel #16857

Uh oh!

Conversation

am17an commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Oct 31, 2025

Uh oh!

Uh oh!

CISC commented Oct 31, 2025

Uh oh!

am17an commented Oct 31, 2025

Uh oh!

reeselevine commented Oct 31, 2025

Uh oh!

reeselevine commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 31, 2025

Uh oh!

reeselevine commented Oct 31, 2025

Uh oh!

am17an commented Nov 1, 2025

Uh oh!

reeselevine commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Oct 30, 2025 •

edited

Loading

reeselevine commented Oct 31, 2025 •

edited

Loading

reeselevine commented Nov 2, 2025 •

edited

Loading