-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CUDA: add expert reduce kernel #16857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } else { | ||
| #pragma unroll | ||
| for (int i = 0; i < n_expert_used_template; ++i) { | ||
| ggml_cuda_mad(acc, experts[col], weights[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried loading weights into shared memory/registers, but it doesn't really make a difference as the memory slice per row is extremely small (n_expert_used floats per row)
Co-authored-by: Johannes Gäßler <[email protected]>
|
Failures seem unrelated, merging |
Not quite, the new test seems to fail spectacularly on webgpu. @reeselevine |
|
It looks like the webGPU build is also failing, it was failing earlier too. |
|
Ok, I see the error here, I'll need to do some investigation to see why it's not using an "inplace" version of the add operation in the new added tests. I'll investigate as soon as I can, in the meantime do we want to disable the webgpu tests to avoid confusion on all new PRs? |
|
From my understanding, it looks like instead of a full overlap in buffers, there is a partial overlap in the two src buffers, so it's not the same as other inplace operations, which have src0=dst and src1 fully disjoint. That's not something the code currently expects, and it seems like it might be unique to these new added tests? I'll have to update the logic to do some sort of merging on buffer bindings to handle this, or as a more temporary fix, see if I can disable support for operations if it does have this sort of partial overlap. |
|
The CI failures are ok for a while, as long as they don't impact any ongoing webgpu work. |
|
I'm actually a little confused about the test added here. Specifically, looking at this line: https://github.com/ggml-org/llama.cpp/pull/16857/files#diff-2749fdb8974ec96afa18444a9d546409318b0a862709139b677eee468c479578R4778, it seems like the 5th argument to Because, if I disable buffer aliasing validation, as is done in this PR: #16810, the new tests do not pass with that line as currently written, but they do pass if it is changed to I will say though that at least testing locally on the Metal backend, the new tests pass regardless of whether it's |
|
That test is a reconstruction of llama-graph.cpp Lines 1116 to 1142 in bea0452
|
|
Ok, I realized this was due to the non-contiguity introduced in the views for this reduction, so I will disable support for these operations in #16810 and leave a note so that it can be added in the future. I don't think anything else is needed here! |
This reverts commit 4146d6a.
This PR adds a kernel to fuse the common MoE mul + (n_expert_used-1)*add operations into one. 1-2% TG speed-up depending on num_expert_used
Tested on a 4090