Misc. bug: regression: using cache-reuse slows down subsequent prompts (chat sessions)

### Name and Version

$./llama-cli --version
version: b6927 (6b9a524) 
built with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
llama-server --port 8383 --host 0.0.0.0 -m qwen3_30B-A3B_Q6_K.gguf --cache-reuse 256 --no-mmap --ctx-size 131072 -fa 1 -ctk f16 -ctv f16 -ts 7/15/16 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0
```

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from libggml-hip.so
load_backend: loaded RPC backend from libggml-rpc.so
load_backend: loaded CPU backend from libggml-cpu-haswell.so
main: setting n_parallel = 4 and kv_unified = true
build: 1 (6b9a524) with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu
```

### Problem description & steps to reproduce

I know it is normal that as context grows in a single chat session, generation t/s becomes slower and slower.
 
After building the newer version, I noticed a weird change:
New chat sessions now start slow and get slower as the conversation continues.   

What used to happen (before):
New chats always started fast (max speed), and only slowed down later as the conversation got longer.   

The problem:
This new slowdown at the very beginning of a fresh chat didn’t happen in the old version.

I started tinkering around and found out that if set `cache-reuse` to `0` then it fixes the issue. 

Broken builds are slow with `--cache-reuse 256` but works find with ` --cache-reuse 0`.

I bisected it and the result was commit cd5e3b57541ecc52421130742f4d89acbcf77cd4 (b6927),  b6923 a version before b6927  seems to work.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: regression: using cache-reuse slows down subsequent prompts (chat sessions) #17065

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: regression: using cache-reuse slows down subsequent prompts (chat sessions) #17065

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions