Skip to content

Conversation

@ggerganov
Copy link
Member

While working on #11213 I realized that we are currently doing many unnecessary graph defrags because of incorrect cache fragmentation logic. The cache padding triggers the fragmentation threshold for small contexts even if there is no fragmentation at all.

./scripts/compare-commits.sh master gg/llama-fix-defrag -m models/llama-3.1-8b-instruct/ggml-model-q4_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-q8_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-f16.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-f16.gguf -fa 1
Model Test t/s master t/s gg/llama-fix-defrag Speedup
llama 8B F16 pp512 1458.51 1458.18 1.00
llama 8B F16 tg128 38.82 39.19 1.01
llama 8B Q4_0 pp512 1324.28 1323.85 1.00
llama 8B Q4_0 tg128 99.55 101.37 1.02
llama 8B Q8_0 pp512 1298.42 1298.34 1.00
llama 8B Q8_0 tg128 66.23 66.99 1.01
qwen2 3B F16 pp512 3226.49 3226.91 1.00
qwen2 3B F16 tg128 71.26 72.44 1.02
qwen2 3B Q4_0 pp512 2927.50 2925.14 1.00
qwen2 3B Q4_0 tg128 138.02 142.55 1.03
qwen2 3B Q8_0 pp512 2880.21 2878.93 1.00
qwen2 3B Q8_0 tg128 108.89 112.35 1.03

master has the following path applied:

diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 4ac19ca86..8e9f90f27 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -753,6 +753,7 @@ struct cmd_params_instance {
         cparams.offload_kqv = !no_kv_offload;
         cparams.flash_attn  = flash_attn;
         cparams.embeddings  = embeddings;
+        cparams.defrag_thold = 0.1f;
 
         return cparams;
     }

@ggerganov ggerganov merged commit ed926d8 into master Feb 7, 2025
50 of 53 checks passed
@ggerganov ggerganov deleted the gg/llama-fix-defrag branch February 7, 2025 14:05
if (cparams.causal_attn && cparams.defrag_thold >= 0.0f) {
const float fragmentation = kv_self.n >= 128 ? 1.0f - float(kv_self.used)/float(kv_self.n) : 0.0f;
if (cparams.causal_attn && cparams.defrag_thold > 0.0f) {
// - do not defrag small contexts (i.e. < 2048 tokens)
Copy link
Contributor

@MoonRide303 MoonRide303 Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov I am sometimes running benchmarks that require only 256 or 512 tokens per slot, with total context size like 512 or 1024 (for big models that don't fully fit into my VRAM). Will it work properly in cases like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defragmentation for such small context is not really worth it, so my expectation is that with this change you should get better performance overall.

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants