-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Name and Version
$./llama-cli --version
version: b6927 (6b9a524)
built with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --port 8383 --host 0.0.0.0 -m qwen3_30B-A3B_Q6_K.gguf --cache-reuse 256 --no-mmap --ctx-size 131072 -fa 1 -ctk f16 -ctv f16 -ts 7/15/16 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI60 / MI50, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from libggml-hip.so
load_backend: loaded RPC backend from libggml-rpc.so
load_backend: loaded CPU backend from libggml-cpu-haswell.so
main: setting n_parallel = 4 and kv_unified = true
build: 1 (6b9a524) with cc (Debian 15.2.0-7) 15.2.0 for x86_64-linux-gnu
Problem description & steps to reproduce
I know it is normal that as context grows in a single chat session, generation t/s becomes slower and slower.
After building the newer version, I noticed a weird change:
New chat sessions now start slow and get slower as the conversation continues.
What used to happen (before):
New chats always started fast (max speed), and only slowed down later as the conversation got longer.
The problem:
This new slowdown at the very beginning of a fresh chat didn’t happen in the old version.
I started tinkering around and found out that if set cache-reuse to 0 then it fixes the issue.
Broken builds are slow with --cache-reuse 256 but works find with --cache-reuse 0.
I bisected it and the result was commit cd5e3b5 (b6927), b6923 a version before b6927 seems to work.