-
Notifications
You must be signed in to change notification settings - Fork 13.2k
kv-cache : fix SWA checks + disable cacheless iSWA #15811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggerganov
commented
Sep 5, 2025
// note that this function uses different SWA parameters from those in the hparams | ||
// TODO: think of a better place for this function | ||
// TODO: pack the SWA params in a struct? | ||
static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to a static function.
Maybe it should become a member like this:
Suggested change
static bool is_masked_swa(uint32_t n_swa, llama_swa_type swa_type, llama_pos p0, llama_pos p1); | |
bool is_masked_swa(uint32_t il, llama_pos p0, llama_pos p1) const; |
But let's refactor this after the master
stabilized.
danbev
approved these changes
Sep 5, 2025
Merging to fix regular SWA models such as |
gabe-l-hart
added a commit
to gabe-l-hart/llama.cpp
that referenced
this pull request
Sep 5, 2025
…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...
gabe-l-hart
added a commit
to gabe-l-hart/llama.cpp
that referenced
this pull request
Sep 5, 2025
…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)
walidbr
pushed a commit
to walidbr/llama.cpp
that referenced
this pull request
Sep 7, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
cont #15798
fix #15808
Support for iSWA models without constructing a KV cache would need a bit more work since the existing
llm_graph_input_attn_no_cache
assumes only a single KQ mask, while to support iSWA we need 2 masks - one for the SWA and one for the non-SWA layers.Also fix a regression for iSWA models introduced in #15798 - the problem is that when we mask the attention we should not use
hparams.swa_type
for all layers - only for the SWA layers. This was handled by the KV cache and that is why it had its ownswa_type
to differentiate from the one inhparams
.