graph : avoid huge warm-up graphs for MoE models #14753

ggerganov · 2025-07-18T08:47:28Z

Just hot loading the experts for matrix multiplication is enough to heat-up the caches. No need to add extra GGML_OP_ADD nodes for aggregating the results.

ggml-ci

ggerganov · 2025-07-18T08:56:38Z

src/llama-context.cpp

@@ -1312,7 +1312,7 @@ uint32_t llama_context::output_reserve(int32_t n_outputs) {
 //

 uint32_t llama_context::graph_max_nodes() const {
-    return std::max<uint32_t>(65536u, 5u*model.n_tensors());
+    return std::max<uint32_t>(1024u, 6u*model.n_tensors());


We should probably bump this up to 8u*model.n_tensors() just to be safe.

slaren · 2025-07-18T09:01:55Z

If I understand correctly, the motivation of this change was to ensure that all weights are loaded into memory when using mmap on a NUMA system. This would effectively revert #11571.

ggerganov · 2025-07-18T10:53:05Z

I think the experts ~~are still mapped~~ continue to be loaded because when we run the ggml_mul_mat_id() calls, we use the large n_expert_used == hparams.n_expert, instead of the original hparams.n_expert_used. So for example this call, during warmup would still load all the experts into memory and perform the warmup:

llama.cpp/src/llama-graph.cpp

Lines 867 to 870 in 033b306

    
           ggml_tensor * up = build_lora_mm_id(up_exps, cur, selected_experts); // [n_ff, n_expert_used, n_tokens] 
        
           cb(up, "ffn_moe_up", il);

The change only removes the summation nodes that sum together the obtained results for each expert. Those do not involve reading data from the model, but contribute to many number of graph nodes.

For reference, here is the n_expert_used initialization:

llama.cpp/src/llama-graph.cpp

Lines 512 to 514 in 033b306

    
           n_expert         (hparams.n_expert), 
        
           n_expert_used    (cparams.warmup ? hparams.n_expert : hparams.n_expert_used), 
        
           freq_base        (cparams.rope_freq_base),

Edit: fixed wording at the start for clarity

fix regression during finetune on Llama-3.2-1B-F32: GGML_ASSERT(cgraph->n_nodes < cgraph->size) failed git bisect applying the most recent finetune (SGD) change showed that d498af3 Georgi Gerganov 2025-07-18 14:31:15 +0300 graph : avoid huge warm-up graphs for MoE models (ggml-org#14753) which greatly decreased graph_max_nodes has been responsible for finetune failing on reasonably sized models for the past two months. partially reverting the decrease (maybe larger models still fail) note: env LLAMA_SET_ROWS=0 is needed also or else: GML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed (the node->op in question is indeed a rows op) unfortunately a git revert on: 8a4280c Georgi Gerganov 2025-08-28 12:27:02 +0300 kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) is not straightforward, so this branch is behind that.

ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4feb0bf to 4c1bacb Compare July 18, 2025 08:47

ggerganov requested a review from slaren July 18, 2025 08:48

graph : avoid huge warm-up graphs for MoE models

033b306

ggml-ci

ggerganov force-pushed the gg/context-reduce-min-nodes branch from 4c1bacb to 033b306 Compare July 18, 2025 08:55

ggerganov commented Jul 18, 2025

View reviewed changes

cont : bump max nodes to 8x model tensors

5883f01

slaren approved these changes Jul 18, 2025

View reviewed changes

ggerganov merged commit d498af3 into master Jul 18, 2025
47 checks passed

ggerganov deleted the gg/context-reduce-min-nodes branch July 18, 2025 11:31

saood06 mentioned this pull request Aug 14, 2025

Enable CUDA graphs for MoE models + GPT-OSS support ikawrakow/ik_llama.cpp#689

Merged

graehl mentioned this pull request Sep 18, 2025

Misc. bug: Finetuning yields different and worse results using CPU backend vs. CUDA backend #15779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

graph : avoid huge warm-up graphs for MoE models #14753

graph : avoid huge warm-up graphs for MoE models #14753

Uh oh!

ggerganov commented Jul 18, 2025

Uh oh!

ggerganov Jul 18, 2025

Uh oh!

slaren commented Jul 18, 2025

Uh oh!

ggerganov commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

graph : avoid huge warm-up graphs for MoE models #14753

graph : avoid huge warm-up graphs for MoE models #14753

Uh oh!

Conversation

ggerganov commented Jul 18, 2025

Uh oh!

ggerganov Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Jul 18, 2025

Uh oh!

ggerganov commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jul 18, 2025 •

edited

Loading