llama : fix defrag logic #11707

ggerganov · 2025-02-06T11:05:03Z

While working on #11213 I realized that we are currently doing many unnecessary graph defrags because of incorrect cache fragmentation logic. The cache padding triggers the fragmentation threshold for small contexts even if there is no fragmentation at all.

./scripts/compare-commits.sh master gg/llama-fix-defrag -m models/llama-3.1-8b-instruct/ggml-model-q4_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-q8_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-f16.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-f16.gguf -fa 1

Model	Test	t/s master	t/s gg/llama-fix-defrag	Speedup
llama 8B F16	pp512	1458.51	1458.18	1.00
llama 8B F16	tg128	38.82	39.19	1.01
llama 8B Q4_0	pp512	1324.28	1323.85	1.00
llama 8B Q4_0	tg128	99.55	101.37	1.02
llama 8B Q8_0	pp512	1298.42	1298.34	1.00
llama 8B Q8_0	tg128	66.23	66.99	1.01
qwen2 3B F16	pp512	3226.49	3226.91	1.00
qwen2 3B F16	tg128	71.26	72.44	1.02
qwen2 3B Q4_0	pp512	2927.50	2925.14	1.00
qwen2 3B Q4_0	tg128	138.02	142.55	1.03
qwen2 3B Q8_0	pp512	2880.21	2878.93	1.00
qwen2 3B Q8_0	tg128	108.89	112.35	1.03

master has the following path applied:

diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 4ac19ca86..8e9f90f27 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -753,6 +753,7 @@ struct cmd_params_instance {
         cparams.offload_kqv = !no_kv_offload;
         cparams.flash_attn  = flash_attn;
         cparams.embeddings  = embeddings;
+        cparams.defrag_thold = 0.1f;
 
         return cparams;
     }

ggml-ci

MoonRide303 · 2025-02-08T07:49:04Z

src/llama.cpp

-    if (cparams.causal_attn && cparams.defrag_thold >= 0.0f) {
-        const float fragmentation = kv_self.n >= 128 ? 1.0f - float(kv_self.used)/float(kv_self.n) : 0.0f;
+    if (cparams.causal_attn && cparams.defrag_thold > 0.0f) {
+        // - do not defrag small contexts (i.e. < 2048 tokens)


@ggerganov I am sometimes running benchmarks that require only 256 or 512 tokens per slot, with total context size like 512 or 1024 (for big models that don't fully fit into my VRAM). Will it work properly in cases like that?

The defragmentation for such small context is not really worth it, so my expectation is that with this change you should get better performance overall.

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

ggerganov added 3 commits February 6, 2025 12:48

llama : fix defrag logic

04c01e9

ggml-ci

cont : better logic

32b8ce5

ggml-ci

cont : clamp fragmentation to 0.0

861d3b9

ggml-ci

ggerganov merged commit ed926d8 into master Feb 7, 2025
50 of 53 checks passed

ggerganov deleted the gg/llama-fix-defrag branch February 7, 2025 14:05

MoonRide303 reviewed Feb 8, 2025

View reviewed changes

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025

llama : fix defrag logic (ggml-org#11707)

4cb80f6

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025

llama : fix defrag logic (ggml-org#11707)

ced442a

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025

llama : fix defrag logic (ggml-org#11707)

b9d4ef0

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

llama : fix defrag logic (ggml-org#11707)

53a1d84

* llama : fix defrag logic ggml-ci * cont : better logic ggml-ci * cont : clamp fragmentation to 0.0 ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : fix defrag logic #11707

llama : fix defrag logic #11707

Uh oh!

ggerganov commented Feb 6, 2025

Uh oh!

Uh oh!

MoonRide303 Feb 8, 2025 •

edited

Loading

Uh oh!

ggerganov Feb 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama : fix defrag logic #11707

llama : fix defrag logic #11707

Uh oh!

Conversation

ggerganov commented Feb 6, 2025

Uh oh!

Uh oh!

MoonRide303 Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MoonRide303 Feb 8, 2025 •

edited

Loading