-
Notifications
You must be signed in to change notification settings - Fork 13.1k
graph : avoid huge warm-up graphs for MoE models #14753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4feb0bf
to
4c1bacb
Compare
4c1bacb
to
033b306
Compare
src/llama-context.cpp
Outdated
@@ -1312,7 +1312,7 @@ uint32_t llama_context::output_reserve(int32_t n_outputs) { | |||
// | |||
|
|||
uint32_t llama_context::graph_max_nodes() const { | |||
return std::max<uint32_t>(65536u, 5u*model.n_tensors()); | |||
return std::max<uint32_t>(1024u, 6u*model.n_tensors()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably bump this up to 8u*model.n_tensors()
just to be safe.
If I understand correctly, the motivation of this change was to ensure that all weights are loaded into memory when using mmap on a NUMA system. This would effectively revert #11571. |
I think the experts Lines 867 to 870 in 033b306
The change only removes the summation nodes that sum together the obtained results for each expert. Those do not involve reading data from the model, but contribute to many number of graph nodes. For reference, here is the Lines 512 to 514 in 033b306
Edit: fixed wording at the start for clarity |
fix regression during finetune on Llama-3.2-1B-F32: GGML_ASSERT(cgraph->n_nodes < cgraph->size) failed git bisect applying the most recent finetune (SGD) change showed that d498af3 Georgi Gerganov 2025-07-18 14:31:15 +0300 graph : avoid huge warm-up graphs for MoE models (ggml-org#14753) which greatly decreased graph_max_nodes has been responsible for finetune failing on reasonably sized models for the past two months. partially reverting the decrease (maybe larger models still fail) note: env LLAMA_SET_ROWS=0 is needed also or else: GML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed (the node->op in question is indeed a rows op) unfortunately a git revert on: 8a4280c Georgi Gerganov 2025-08-28 12:27:02 +0300 kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) is not straightforward, so this branch is behind that.
fix regression during finetune on Llama-3.2-1B-F32: GGML_ASSERT(cgraph->n_nodes < cgraph->size) failed git bisect applying the most recent finetune (SGD) change showed that d498af3 Georgi Gerganov 2025-07-18 14:31:15 +0300 graph : avoid huge warm-up graphs for MoE models (ggml-org#14753) which greatly decreased graph_max_nodes has been responsible for finetune failing on reasonably sized models for the past two months. partially reverting the decrease (maybe larger models still fail) note: env LLAMA_SET_ROWS=0 is needed also or else: GML_ASSERT(!node->view_src || node->op == GGML_OP_CPY || node->op == GGML_OP_VIEW || node->op == GGML_OP_RESHAPE || node->op == GGML_OP_PERMUTE || node->op == GGML_OP_TRANSPOSE) failed (the node->op in question is indeed a rows op) unfortunately a git revert on: 8a4280c Georgi Gerganov 2025-08-28 12:27:02 +0300 kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) is not straightforward, so this branch is behind that.
Just hot loading the experts for matrix multiplication is enough to heat-up the caches. No need to add extra
GGML_OP_ADD
nodes for aggregating the results.