Skip to content

Misc. bug: regression: using cache-reuse slows down subsequent prompts (chat sessions) #17065

@daitj

Description

@daitj

Name and Version

$./llama-cli --version
version: b6927 (6b9a524)
built with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server --port 8383 --host 0.0.0.0 -m qwen3_30B-A3B_Q6_K.gguf --cache-reuse 256 --no-mmap --ctx-size 131072 -fa 1 -ctk f16 -ctv f16 -ts 7/15/16 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from libggml-hip.so
load_backend: loaded RPC backend from libggml-rpc.so
load_backend: loaded CPU backend from libggml-cpu-haswell.so
main: setting n_parallel = 4 and kv_unified = true
build: 1 (6b9a524) with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu

Problem description & steps to reproduce

I know it is normal that as context grows in a single chat session, generation t/s becomes slower and slower.

After building the newer version, I noticed a weird change:
New chat sessions now start slow and get slower as the conversation continues.

What used to happen (before):
New chats always started fast (max speed), and only slowed down later as the conversation got longer.

The problem:
This new slowdown at the very beginning of a fresh chat didn’t happen in the old version.

I started tinkering around and found out that if set cache-reuse to 0 then it fixes the issue.

Broken builds are slow with --cache-reuse 256 but works find with --cache-reuse 0.

I bisected it and the result was commit cd5e3b5 (b6927), b6923 a version before b6927 seems to work.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions