kv-cache : fix SWA checks + disable cacheless iSWA #15811

ggerganov · 2025-09-05T05:21:01Z

Support for iSWA models without constructing a KV cache would need a bit more work since the existing llm_graph_input_attn_no_cache assumes only a single KQ mask, while to support iSWA we need 2 masks - one for the SWA and one for the non-SWA layers.

Also fix a regression for iSWA models introduced in #15798 - the problem is that when we mask the attention we should not use hparams.swa_type for all layers - only for the SWA layers. This was handled by the KV cache and that is why it had its own swa_type to differentiate from the one in hparams.

ggml-ci

ggerganov · 2025-09-05T05:24:14Z

src/llama-hparams.h

+    // note that this function uses different SWA parameters from those in the hparams
+    // TODO: think of a better place for this function
+    // TODO: pack the SWA params in a struct?
+    static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1);


Changed this to a static function.

Maybe it should become a member like this:

Suggested change

static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1);

bool is_masked_swa(uint32_t il, llama_pos p0, llama_pos p1) const;

But let's refactor this after the master stabilized.

ggerganov · 2025-09-05T07:39:18Z

Merging to fix regular SWA models such as gpt-oss. We can improve EmbeddingGemma support from master.

…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...

…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)

ggml-ci

kv-cache : fix SWA checks + disable cacheless iSWA

43b78f1

ggml-ci

ggerganov mentioned this pull request Sep 5, 2025

Eval bug: gpt-oss incoherent output #15808

Closed

ggerganov commented Sep 5, 2025

View reviewed changes

ggerganov requested a review from danbev September 5, 2025 05:24

danbev approved these changes Sep 5, 2025

View reviewed changes

ggerganov merged commit c610b6c into master Sep 5, 2025
55 checks passed

ggerganov deleted the gg/kv-cache-fix-swa branch September 5, 2025 07:39

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)

b883c5e

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : fix SWA checks + disable cacheless iSWA #15811

kv-cache : fix SWA checks + disable cacheless iSWA #15811

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

ggerganov Sep 5, 2025

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

	static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1);
	bool is_masked_swa(uint32_t il, llama_pos p0, llama_pos p1) const;

kv-cache : fix SWA checks + disable cacheless iSWA #15811

kv-cache : fix SWA checks + disable cacheless iSWA #15811

Uh oh!

Conversation

ggerganov commented Sep 5, 2025

Uh oh!

ggerganov Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!