implement context checkpointing for hybrid and recurrent models #16382

ddh0 · 2025-10-02T06:23:28Z

This PR generalizes the SWA checkpointing logic (ref #15293) to also create checkpoints for recurrent and hybrid models such as Mamba, Jamba, etc.

SWA-specific parts of the code are generalized:
- the --swa-checkpoints CLI arg is renamed to --ctx-checkpoints
- the internal LLAMA_STATE_SEQ_FLAGS_SWA_ONLY flag is renamed to LLAMA_STATE_SEQ_FLAGS_CHECKPOINT_ONLY
adds llama_model_is_hybrid to llama-model.cpp and llama.h

This removes the need to re-process the entire context in the majority of cases.

Would resolve #15677 and #14625

Make sure to read the contributing guidelines before submitting a PR

this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite

ggerganov · 2025-10-02T06:55:11Z

include/llama.h

                          size_t * n_token_count_out);

-#define LLAMA_STATE_SEQ_FLAGS_SWA_ONLY 1
+#define LLAMA_STATE_SEQ_FLAGS_CHECKPOINT_ONLY 1


We need a bit better name than this. The old name does not work, but the proposed new name is confusing.

The purpose of this flag is to indicate that we want save only the "small" caches such as SWA, "recr", etc. But I can't think of a good name to call it.

I see what you mean. I can't think of anything better at the moment

common/arg.cpp

ggerganov · 2025-10-02T07:04:19Z

tools/server/server.cpp

+                    // make a checkpoint of the parts of memory that cannot be rolled back.
+                    // checkpoints are needed only if:
+                    // - the model uses SWA and we are not using `swa_full`
+                    // - the model architecture is marked as recurrent or hybrid
+                    bool do_checkpoint = (llama_model_is_recurrent(model) || llama_model_is_hybrid(model)) ||
+                         (llama_model_n_swa(model) > 0 && !params_base.swa_full);
+


I'm a bit torn on this logic for determining when to do checkpoints. It should be centred around the memory module or the context, rather than the model.

Just making a note for the future - no need to change anything in this PR.

Co-authored-by: Georgi Gerganov <[email protected]>

ddh0 · 2025-10-02T07:14:58Z

Hmm, maybe I should mark this as draft. Because sometimes it works, and sometimes I still see this:

srv  log_server_r: request: POST /v1/chat/completions 192.168.68.69 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task 595 | selected slot by lcs similarity, lcs_len = 872, similarity = 0.395 (> 0.100 thold)
slot launch_slot_: id  0 | task 1859 | processing task
slot update_slots: id  0 | task 1859 | new prompt, n_ctx_slot = 262144, n_keep = 0, n_prompt_tokens = 4084
slot update_slots: id  0 | task 1859 | n_past = 872, cache_tokens.size() = 2205, seq_id = 0, pos_min = 2204, n_swa = 0
slot update_slots: id  0 | task 1859 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 1859 | n_past = 0
slot update_slots: id  0 | task 1859 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.501469
slot update_slots: id  0 | task 1859 | n_past = 2048
slot update_slots: id  0 | task 1859 | prompt processing progress, n_past = 4084, n_tokens = 2036, progress = 1.000000
slot update_slots: id  0 | task 1859 | prompt done, n_past = 4084, n_tokens = 2036
slot update_slots: id  0 | task 1859 | saved context checkpoint 1 of 32 (pos_min = 4083, pos_max = 4083, size = 16.626 MiB)
slot      release: id  0 | task 1859 | stop processing: n_past = 4121, truncated = 0
slot print_timing: id  0 | task 1859 | 
prompt eval time =    9546.70 ms /  4084 tokens (    2.34 ms per token,   427.79 tokens per second)
       eval time =    4865.95 ms /    38 tokens (  128.05 ms per token,     7.81 tokens per second)
      total time =   14412.65 ms /  4122 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.68.69 200

Or maybe this is unavoidable in some cases, I'm not sure.

ddh0 · 2025-10-02T07:31:09Z

FYI I am using this Q8_0 quant of Jamba Mini to test this PR

ggerganov · 2025-10-02T07:43:11Z

saved context checkpoint 1 of 32

I don't think there were any previous checkpoints in this case. But it's unclear since we don't see the preceding logs.

ddh0 · 2025-10-02T07:54:27Z

While testing with a multi-turn conversation, the model seems to get increasingly confused as time goes on. I don't know what could be causing it, but it sure seems like something is broken. If you'd like I can mark this as draft.

Here are the full console logs from that conversation: full_jamba_mini_console_logs.txt

ddh0 · 2025-10-02T07:55:39Z

saved context checkpoint 1 of 32

I don't think there were any previous checkpoints in this case. But it's unclear since we don't see the preceding logs.

In that case I believe there was, but as soon as forcing full prompt re-processing due to lack of cache data is triggered then it invalidates all the checkpoints anyway.

ggerganov · 2025-10-02T08:34:44Z

I think the checkpointing logic is good, but maybe there is an issue with saving/restoring the recurrent state. Try to adapt/run the llama-save-load-state and see if it runs correctly with mamba/jamba architectures in order to confirm that state save/load works correct.

pwilkin · 2025-10-02T10:07:34Z

I guess fixes #15677 ? :>

ggerganov · 2025-10-02T10:24:08Z

@ddh0 Try to apply this patch on top of this PR:

diff --git a/tools/server/server.cpp b/tools/server/server.cpp
index db1f6d1aa..0edffa22f 100644
--- a/tools/server/server.cpp
+++ b/tools/server/server.cpp
@@ -3552,7 +3552,7 @@ struct server_context {
 
                                 const auto pos_min_thold = std::max(0, slot.n_past - n_swa);
 
-                                if (pos_min > pos_min_thold) {
+                                if (pos_min > pos_min_thold + 1) {
                                     SLT_WRN(slot, "n_past = %d, cache_tokens.size() = %d, seq_id = %d, pos_min = %d, n_swa = %d\n", slot.n_past, (int) slot.cache_tokens.size(), slot.id, pos_min, n_swa);
 
                                     // search for a context checkpoint

I think this works. It is not perfectly optimal in the sense that it will always reprocess the last response. But at least it will keep all the conversation up to the penultimate response.

tools/server/server.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

ddh0 · 2025-10-02T15:41:42Z

I think this works. It is not perfectly optimal in the sense that it will always reprocess the last response. But at least it will keep all the conversation up to the penultimate response.

Thank you, I've applied this change now.

I guess fixes #15677 ? :>

Thanks, I've added this PR as well as #14625 to the main post.

ggerganov · 2025-10-02T16:25:26Z

@ddh0 Can you confirm the quality is good now?

ddh0 · 2025-10-02T17:05:55Z

Can you confirm the quality is good now?

The quality is fine until I re-generate from the same point in the context (by pressing the 🔄 button in llama-server). Very often (but not every single time) when I re-generate it's like the model gets confused and gives an irrelevant response.

Example:

User: What is the square root of 81?
AI: The square root of 81 is 9.
User: Please repeat that?
AI (initial response): The square root of 81 is 9.
AI (🔄1): Hello there, I'm your personal AI assistant. How can I assist you?
AI (🔄2): The square root of 81 is 9.

To find the square root of a number, you can follow these steps:

Look at the number and see if it's a perfect square. For 81, you can check if it's 9 times 9.

If it's not a perfect square, you can use a calculator to find the square root.

Alternatively, you can use a method like long division or factorization to find the square root.

In this case, 81 is a perfect square, so the square root is 9.

AI (🔄3): The square root of 81 is 9.

There must be a bug somewhere still. Let me look over the code again and see if I can find anything suspicious.

(This is as of commit 126e08a)

ggerganov · 2025-10-02T17:35:17Z

Thanks, I think I see the issue and have a fix. But there is a bug in the WebUI which is making this too difficult to debug: #16385

Let's come back to this after it is resolved.

allozaur · 2025-10-03T06:29:56Z

Thanks, I think I see the issue and have a fix. But there is a bug in the WebUI which is making this too difficult to debug: #16385

Let's come back to this after it is resolved.

@ggerganov I've pushed #16402 which fixes #16385

ggerganov · 2025-10-03T08:17:10Z

@ddh0 I just force pushed a patch to your branch that I think should work correctly. Let me know if you spot any problems.

ddh0 · 2025-10-03T09:05:46Z

Here are the console logs for a simple chat with Jamba Mini, only 3 messages, without any re-generations or editing.

As you mentioned, it seems like it always needs to re-process the last response. And for the very first message, it can still trigger forcing full prompt re-processing due to lack of cache data, but it looks like the problem goes away for all subsequent messages in the chat.

I will do some more testing between this and master to see if I can find any difference, as well as testing with longer conversations. But this seems to be working! 🥳

ggerganov · 2025-10-03T16:20:10Z

@ddh0 Thanks for implementing and testing.

I'll push a few more changes to this branch if that's ok and merge it.

ddh0 · 2025-10-03T16:59:14Z

Thanks for implementing and testing.

Of course :) please let me know if you'd like me to do any more testing before it's merged in

ddh0 and others added 5 commits October 1, 2025 23:14

initial commit for branch 3

6b3d5e2

Merge branch 'ggml-org:master' into mamba-checkpoints-3

257d492

generalize swa_checkpoint to ctx_checkpoint

cfba346

this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite

oops

ba574ba

disable debug prints

fa222c5

ddh0 requested review from ngxson, ggerganov and CISC as code owners October 2, 2025 06:23

github-actions bot added examples server labels Oct 2, 2025

ggerganov reviewed Oct 2, 2025

View reviewed changes

keep backwards compat with --swa-checkpoints

475e80b

Co-authored-by: Georgi Gerganov <[email protected]>

update prompt re-processing message

4fee0cc

Merge branch 'ggml-org:master' into mamba-checkpoints-3

a3b4c17

ggerganov reviewed Oct 2, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

ddh0 and others added 2 commits October 2, 2025 10:35

fix off-by-one error per GG

d304f02

keep seq_rm log per GG

bb92d83

Co-authored-by: Georgi Gerganov <[email protected]>

Merge branch 'ggml-org:master' into mamba-checkpoints-3

126e08a

Merge branch 'ggml-org:master' into mamba-checkpoints-3

9f996a7

ggerganov added 2 commits October 3, 2025 11:01

Merge branch 'master' into mamba-checkpoints-3

e1b68d8

server : fix checkpoint logic to support recurrent caches

829c701

ggerganov force-pushed the mamba-checkpoints-3 branch from 2e1b88f to 829c701 Compare October 3, 2025 08:15

ggerganov added 2 commits October 3, 2025 20:03

Merge branch 'master' into mamba-checkpoints-3

85d5053

server : cleanup and fixes

6fc5bcd

ggerganov approved these changes Oct 3, 2025

View reviewed changes

ggerganov merged commit f6dcda3 into ggml-org:master Oct 3, 2025
66 of 68 checks passed

ddh0 mentioned this pull request Oct 3, 2025

changelog : libllama API #9289

Open

This was referenced Oct 3, 2025

Eval bug: Nemotron-H-47B-Reasoning always reprocesses prompt (even after #16382) #16416

Closed

[BUG] Trimming trailing whitespace from assistant's previous response breaks llama.cpp SWA checkpoints SillyTavern/SillyTavern#4611

Closed

ggerganov mentioned this pull request Oct 6, 2025

server : improve context checkpoint logic #16440

Merged

askmyteapot mentioned this pull request Oct 7, 2025

Granite 4 Tiny KV Cache issue LostRuins/koboldcpp#1781

Open

implement context checkpointing for hybrid and recurrent models #16382

implement context checkpointing for hybrid and recurrent models #16382

Uh oh!

Conversation

ddh0 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ddh0 Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ddh0 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

ddh0 commented Oct 2, 2025

Uh oh!

ddh0 commented Oct 2, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

pwilkin commented Oct 2, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

Uh oh!

ddh0 commented Oct 2, 2025

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

ddh0 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 2, 2025

Uh oh!

allozaur commented Oct 3, 2025

Uh oh!

ggerganov commented Oct 3, 2025

Uh oh!

ddh0 commented Oct 3, 2025

Uh oh!

ggerganov commented Oct 3, 2025

Uh oh!

ddh0 commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

ddh0 commented Oct 2, 2025 •

edited

Loading

ddh0 commented Oct 2, 2025 •

edited

Loading

ddh0 commented Oct 2, 2025 •

edited

Loading

ddh0 commented Oct 2, 2025 •

edited

Loading