Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Oct 30, 2025

This PR adds a kernel to fuse the common MoE mul + (n_expert_used-1)*add operations into one. 1-2% TG speed-up depending on num_expert_used

Tested on a 4090

Model Test t/s master t/s expert-reduce Speedup
gpt-oss 20B MXFP4 MoE tg32 198.21 200.64 1.01
gpt-oss 20B MXFP4 MoE tg64 195.64 197.59 1.01
gpt-oss 20B MXFP4 MoE tg128 193.52 194.77 1.01
qwen3moe 30B.A3B Q4_0 tg32 167.77 171.98 1.03
qwen3moe 30B.A3B Q4_0 tg64 161.53 165.00 1.02
qwen3moe 30B.A3B Q4_0 tg128 159.33 162.50 1.02

@am17an am17an requested a review from slaren as a code owner October 30, 2025 08:10
} else {
#pragma unroll
for (int i = 0; i < n_expert_used_template; ++i) {
ggml_cuda_mad(acc, experts[col], weights[i]);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried loading weights into shared memory/registers, but it doesn't really make a difference as the memory slice per row is extremely small (n_expert_used floats per row)

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 30, 2025
Co-authored-by: Johannes Gäßler <[email protected]>
@am17an
Copy link
Collaborator Author

am17an commented Oct 31, 2025

Failures seem unrelated, merging

@am17an am17an merged commit 4146d6a into ggml-org:master Oct 31, 2025
66 of 69 checks passed
@am17an am17an deleted the expert-reduce branch October 31, 2025 12:05
@CISC
Copy link
Collaborator

CISC commented Oct 31, 2025

Failures seem unrelated, merging

Not quite, the new test seems to fail spectacularly on webgpu. @reeselevine
https://github.com/ggml-org/llama.cpp/actions/runs/18971988336/job/54181756009#step:7:40822

@am17an
Copy link
Collaborator Author

am17an commented Oct 31, 2025

It looks like the webGPU build is also failing, it was failing earlier too.

@reeselevine
Copy link
Collaborator

Ok, I see the error here, I'll need to do some investigation to see why it's not using an "inplace" version of the add operation in the new added tests. I'll investigate as soon as I can, in the meantime do we want to disable the webgpu tests to avoid confusion on all new PRs?

@reeselevine
Copy link
Collaborator

reeselevine commented Oct 31, 2025

From my understanding, it looks like instead of a full overlap in buffers, there is a partial overlap in the two src buffers, so it's not the same as other inplace operations, which have src0=dst and src1 fully disjoint.

That's not something the code currently expects, and it seems like it might be unique to these new added tests?

I'll have to update the logic to do some sort of merging on buffer bindings to handle this, or as a more temporary fix, see if I can disable support for operations if it does have this sort of partial overlap.

@CISC
Copy link
Collaborator

CISC commented Oct 31, 2025

The CI failures are ok for a while, as long as they don't impact any ongoing webgpu work.

@reeselevine
Copy link
Collaborator

I'm actually a little confused about the test added here. Specifically, looking at this line: https://github.com/ggml-org/llama.cpp/pull/16857/files#diff-2749fdb8974ec96afa18444a9d546409318b0a862709139b677eee468c479578R4778, it seems like the 5th argument to ggml_view_2d should be weighted->nb[1], not weighted->nb[2]. This is based on the definition of ggml_view_2d.

Because, if I disable buffer aliasing validation, as is done in this PR: #16810, the new tests do not pass with that line as currently written, but they do pass if it is changed to nb[1]. However, tests for other backends are passing currently, so perhaps I'm not understanding something correctly, and I don't have much context on the mixture of experts algorithm in general, so that's very possible.

I will say though that at least testing locally on the Metal backend, the new tests pass regardless of whether it's nb[2] or nb[1], which might because the other backends are using fused addition kernels?

@am17an
Copy link
Collaborator Author

am17an commented Nov 1, 2025

That test is a reconstruction of llama-graph.cpp

llama.cpp/src/llama-graph.cpp

Lines 1116 to 1142 in bea0452

// order the views before the adds
for (uint32_t i = 0; i < hparams.n_expert_used; ++i) {
cur_experts[i] = ggml_view_2d(ctx0, experts, n_embd, n_tokens, experts->nb[2], i*experts->nb[1]);
ggml_build_forward_expand(gf, cur_experts[i]);
}
// aggregate experts
// note: here we explicitly use hparams.n_expert_used instead of n_expert_used
// to avoid potentially a large number of add nodes during warmup
// ref: https://github.com/ggml-org/llama.cpp/pull/14753
ggml_tensor * moe_out = cur_experts[0];
for (uint32_t i = 1; i < hparams.n_expert_used; ++i) {
moe_out = ggml_add(ctx0, moe_out, cur_experts[i]);
}
if (hparams.n_expert_used == 1) {
// avoid returning a non-contiguous tensor
moe_out = ggml_cont(ctx0, moe_out);
}
cb(moe_out, "ffn_moe_out", il);
return moe_out;
}

@reeselevine
Copy link
Collaborator

reeselevine commented Nov 2, 2025

Ok, I realized this was due to the non-contiguity introduced in the views for this reduction, so I will disable support for these operations in #16810 and leave a note so that it can be added in the future. I don't think anything else is needed here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants