Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Oct 22, 2025

This is a follow up to #16630. This PR adds ability to fuse the following common GEMV operations:

  • GLU
  • Bias + GLU
  • Bias

It uses a template bool to determine if we are in the fusion path, then does runtime checks for which fusion path to take. This PR also splits up mmvq (by type) and mmvf (by ncols-dst) as their compile times were becoming large after this change. This change helps TG (which is IO bound) to almost all class of models. Apart from adding tests to test-backend-ops I also spot-checked perplexity on a couple of models and it is unchanged by this change.

Tested on 6x 4090

Model Test t/s master t/s cuda_fuse_gate Speedup
gpt-oss 120B MXFP4 MoE tg32 118.72 125.14 1.05
gpt-oss 120B MXFP4 MoE tg64 116.91 123.09 1.05
gpt-oss 120B MXFP4 MoE tg128 115.72 121.74 1.05
gpt-oss 20B MXFP4 MoE tg32 171.60 180.07 1.05
gpt-oss 20B MXFP4 MoE tg64 169.46 177.63 1.05
gpt-oss 20B MXFP4 MoE tg128 167.58 175.59 1.05
qwen3moe 30B.A3B Q4_0 tg32 154.72 162.06 1.05
qwen3moe 30B.A3B Q4_0 tg64 151.37 158.40 1.05
qwen3moe 30B.A3B Q4_0 tg128 149.25 156.00 1.05
qwen3 0.6B F16 tg32 310.61 333.92 1.08
qwen3 0.6B F16 tg64 306.26 325.99 1.06
qwen3 0.6B F16 tg128 303.14 322.62 1.06
glm4moe 106B.A12B IQ4_XS - 4.25 bpw tg32 68.99 72.30 1.05
glm4moe 106B.A12B IQ4_XS - 4.25 bpw tg64 68.24 71.44 1.05
glm4moe 106B.A12B IQ4_XS - 4.25 bpw tg128 67.53 70.71 1.05
llama 8B Q4_0 tg32 133.00 137.42 1.03
llama 8B Q4_0 tg64 131.89 136.47 1.03
llama 8B Q4_0 tg128 130.78 135.35 1.03
gemma 7B Q4_0 tg32 123.23 126.88 1.03
gemma 7B Q4_0 tg64 122.28 125.76 1.03
gemma 7B Q4_0 tg128 121.45 124.74 1.03

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 22, 2025
@am17an am17an force-pushed the cuda_fuse_gate_bias branch from c0a69df to 22ee634 Compare October 22, 2025 08:04
@am17an
Copy link
Collaborator Author

am17an commented Oct 22, 2025

@ggerganov after #16649 and this PR, tg for gpt-oss models should increase by ~9-10%

@ORippler
Copy link
Contributor

Curious but how much does this increase binary size for the cuda backend?

@am17an
Copy link
Collaborator Author

am17an commented Oct 22, 2025

Curious but how much does this increase binary size for the cuda backend?

It increases about ~20% (from 30M to 36M on my machine)

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do performance testing on either Friday or Saturday when (hopefully) I'll finally be able to get the RTX 5090 that NVIDIA sent me to work.

@JohannesGaessler
Copy link
Collaborator

Regarding binary size: when I compile the CUDA backend with GGML_NATIVE=OFF the size of libggml-cuda.so increases from 106 MiB to 145 MiB. This seems disproportionate to the amount of added template instances. Did you check for register spilling as ncols increases? That would result in disproportionate compilation times and binary sizes and the performance would be bad anyways.

In any case, for MMVF we can shave off a bit independently of this PR by only compiling it for cases not covered by MMF.

@am17an
Copy link
Collaborator Author

am17an commented Oct 23, 2025

Since the main-use is ncols=1,I am also okay in just doing fusion for that case.

@JohannesGaessler
Copy link
Collaborator

That would I think also be fine. Matrix multiplications with small batch sizes > 1 are relevant for batched inference throughput and speculative decoding but we can always revisit those cases later.

@am17an am17an force-pushed the cuda_fuse_gate_bias branch 4 times, most recently from a6e0d34 to 9b95697 Compare October 23, 2025 16:41
@am17an
Copy link
Collaborator Author

am17an commented Oct 23, 2025

Simplified the code to just fuse on ncols_dst = 1, now binary size and compilation time should be mostly unaffected with this change

const int blocks_per_row_x = ncols_x / qk;
constexpr int blocks_per_iter = vdr * nwarps*warp_size / qi;

// The MUL_MAT_ID code path with ids != nullptr is only implemented for ncols_dst == 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you're removing comments such as this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked codex to do this and it seems to have removed comments, possibly to compress it's context window. I added them back

@am17an am17an force-pushed the cuda_fuse_gate_bias branch from 6614a9b to 65a098f Compare October 25, 2025 04:49
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 26, 2025

When I tested performance:

GPU Model Microbatch size Test t/s 5cca254 t/s 65a098f Speedup
MI50 gpt-oss 20B MXFP4 MoE 512 tg128 102.93 119.86 1.16
MI50 llama 1B BF16 512 tg128 153.86 148.99 0.97
MI50 llama 1B F16 512 tg128 152.85 149.61 0.98
MI50 llama 1B Q4_0 512 tg128 298.27 326.99 1.10
MI50 llama 1B all F32 512 tg128 94.44 97.73 1.03
MI50 llama 8B Q4_0 512 tg128 84.99 91.74 1.08
P40 gemma3 4B Q4_0 512 tg128 73.25 74.34 1.01
P40 gpt-oss 20B MXFP4 MoE 512 tg128 72.42 62.37 0.86
P40 llama 1B BF16 512 tg128 110.78 110.91 1.00
P40 llama 1B F16 512 tg128 110.45 109.99 1.00
P40 llama 1B Q4_0 512 tg128 217.79 229.01 1.05
P40 llama 1B all F32 512 tg128 59.47 59.61 1.00
P40 llama 8B F16 512 tg128 19.70 19.71 1.00
P40 llama 8B Q4_0 512 tg128 54.59 51.99 0.95
P40 qwen3 0.6B Q4_0 512 tg128 207.40 215.85 1.04
P40 qwen3moe 30B.A3B Q4_0 512 tg128 64.97 66.54 1.02
RX 6800 gpt-oss 20B MXFP4 MoE 512 tg128 82.41 90.99 1.10
RX 6800 llama 1B BF16 512 tg128 97.26 104.51 1.07
RX 6800 llama 1B F16 512 tg128 97.41 104.21 1.07
RX 6800 llama 1B Q4_0 512 tg128 218.19 236.98 1.09
RX 6800 llama 1B all F32 512 tg128 79.92 82.37 1.03
RX 6800 llama 8B Q4_0 512 tg128 67.14 70.79 1.05
RX 9060 XT gpt-oss 20B MXFP4 MoE 512 tg128 73.86 80.03 1.08
RX 9060 XT llama 1B BF16 512 tg128 89.00 94.90 1.07
RX 9060 XT llama 1B F16 512 tg128 90.06 94.72 1.05
RX 9060 XT llama 1B Q4_0 512 tg128 183.53 195.18 1.06
RX 9060 XT llama 1B all F32 512 tg128 57.42 58.57 1.02
RX 9060 XT llama 8B Q4_0 512 tg128 52.11 54.06 1.04
RTX 3090 gpt-oss 20B MXFP4 MoE 512 tg128 187.76 191.65 1.02
RTX 3090 llama 1B BF16 512 tg128 273.53 276.84 1.01
RTX 3090 llama 1B F16 512 tg128 273.68 277.20 1.01
RTX 3090 llama 1B Q4_0 512 tg128 526.23 561.10 1.07
RTX 3090 llama 1B all F32 512 tg128 153.19 155.44 1.01
RTX 3090 llama 8B Q4_0 512 tg128 142.84 146.64 1.03
RTX 4090 gpt-oss 20B MXFP4 MoE 512 tg128 232.12 245.32 1.06
RTX 4090 llama 1B BF16 512 tg128 317.22 324.40 1.02
RTX 4090 llama 1B F16 512 tg128 317.76 325.11 1.02
RTX 4090 llama 1B Q4_0 512 tg128 690.64 723.50 1.05
RTX 4090 llama 1B all F32 512 tg128 174.58 176.87 1.01
RTX 4090 llama 8B Q4_0 512 tg128 170.52 175.33 1.03

On the P40 the fused MMVQ kernel does not seem to be consistently faster so I would suggest enabling fusion of that kernel only for Volta and newer.

@am17an
Copy link
Collaborator Author

am17an commented Oct 26, 2025

Thanks for testing!

@am17an am17an merged commit f77c13b into ggml-org:master Oct 26, 2025
72 checks passed
@am17an am17an deleted the cuda_fuse_gate_bias branch October 26, 2025 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants