Skip to content

Eval bug: processing server tasks via REST API seems to freeze, with no response sent back to the calling application #14073

@wolfpld

Description

@wolfpld

Name and Version

version: 5606 (247e5c6)
built with clang version 19.1.7 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 9950X + RTX 4090

Models

vonjack/bge-m3-gguf/bge-m3-q8_0-bert_cpp.gguf

Problem description & steps to reproduce

I'm using llama-server via llama-swap, with the following options:

llama-server --flash-attn -ngl 999 -ngld 999 --model ... --ctx-size 4096 --embeddings --pooling cls

My application sends a POST request to the /v1/embeddings API with the following contents:

{
  "input": [
    "string 1",
    "string 2",
    ...
  ],
  "model": "embed-bge-m3"
}

There are max 64 strings in any single batch.

My application starts by sending a single empty string to determine the embeddings length:

slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 2, n_tokens = 2
slot      release: id  0 | task 0 | stop processing: n_past = 2, truncated = 0
srv  update_slots: all slots are idle

Then it starts sending the real data:

srv  log_server_r: request: POST /v1/embeddings 127.0.0.1 200
slot launch_slot_: id  0 | task 2 | processing task
slot update_slots: id  0 | task 2 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id  0 | task 2 | kv cache rm [0, end)
slot update_slots: id  0 | task 2 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id  0 | task 2 | prompt done, n_past = 7, n_tokens = 7
slot      release: id  0 | task 2 | stop processing: n_past = 7, truncated = 0
slot launch_slot_: id  0 | task 3 | processing task
slot update_slots: id  0 | task 3 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 8
slot update_slots: id  0 | task 3 | kv cache rm [0, end)
slot update_slots: id  0 | task 3 | prompt processing progress, n_past = 8, n_tokens = 8, progress = 1.000000
slot update_slots: id  0 | task 3 | prompt done, n_past = 8, n_tokens = 8
slot      release: id  0 | task 3 | stop processing: n_past = 8, truncated = 0
slot launch_slot_: id  0 | task 4 | processing task
slot update_slots: id  0 | task 4 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 11
slot update_slots: id  0 | task 4 | kv cache rm [0, end)
slot update_slots: id  0 | task 4 | prompt processing progress, n_past = 11, n_tokens = 11, progress = 1.000000
slot update_slots: id  0 | task 4 | prompt done, n_past = 11, n_tokens = 11
slot      release: id  0 | task 4 | stop processing: n_past = 11, truncated = 0

<this continues with incrementing task numbers>

slot launch_slot_: id  0 | task 39 | processing task
slot update_slots: id  0 | task 39 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 106
slot update_slots: id  0 | task 39 | kv cache rm [0, end)
slot update_slots: id  0 | task 39 | prompt processing progress, n_past = 106, n_tokens = 106, progress = 1.000000
slot update_slots: id  0 | task 39 | prompt done, n_past = 106, n_tokens = 106
slot      release: id  0 | task 39 | stop processing: n_past = 106, truncated = 0
slot launch_slot_: id  0 | task 40 | processing task
slot update_slots: id  0 | task 40 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 39
slot update_slots: id  0 | task 40 | kv cache rm [0, end)
slot update_slots: id  0 | task 40 | prompt processing progress, n_past = 39, n_tokens = 39, progress = 1.000000
slot update_slots: id  0 | task 40 | prompt done, n_past = 39, n_tokens = 39
slot      release: id  0 | task 40 | stop processing: n_past = 39, truncated = 0
srv  update_slots: all slots are idle

At this point llama-server freezes processing the input data and does not send any reply to my application. When the socket is closed, it can be seen that there were 64 tasks in the batch:

srv  cancel_tasks: cancel task, id_task = 65
srv  cancel_tasks: cancel task, id_task = 64
srv  cancel_tasks: cancel task, id_task = 63
srv  cancel_tasks: cancel task, id_task = 62

<tasks keep decreasing>

srv  cancel_tasks: cancel task, id_task = 5
srv  cancel_tasks: cancel task, id_task = 4
srv  cancel_tasks: cancel task, id_task = 3
srv  cancel_tasks: cancel task, id_task = 2
srv  update_slots: all slots are idle

When another embeddings request is issued, the task identifiers seem to be wrong:

lot launch_slot_: id  0 | task 169 | processing task
slot update_slots: id  0 | task 169 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id  0 | task 169 | kv cache rm [0, end)
slot update_slots: id  0 | task 169 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id  0 | task 169 | prompt done, n_past = 2, n_tokens = 2
slot      release: id  0 | task 169 | stop processing: n_past = 2, truncated = 0
slot launch_slot_: id  0 | task 42 | processing task
slot update_slots: id  0 | task 42 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 75
slot update_slots: id  0 | task 42 | kv cache rm [0, end)
slot update_slots: id  0 | task 42 | prompt processing progress, n_past = 75, n_tokens = 75, progress = 1.000000
slot update_slots: id  0 | task 42 | prompt done, n_past = 75, n_tokens = 75
srv  log_server_r: request: POST /v1/embeddings 127.0.0.1 200
slot      release: id  0 | task 42 | stop processing: n_past = 75, truncated = 0
slot launch_slot_: id  0 | task 172 | processing task
slot update_slots: id  0 | task 172 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id  0 | task 172 | kv cache rm [0, end)
slot update_slots: id  0 | task 172 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id  0 | task 172 | prompt done, n_past = 7, n_tokens = 7
slot      release: id  0 | task 172 | stop processing: n_past = 7, truncated = 0
slot launch_slot_: id  0 | task 44 | processing task
slot update_slots: id  0 | task 44 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 54
slot update_slots: id  0 | task 44 | kv cache rm [0, end)
slot update_slots: id  0 | task 44 | prompt processing progress, n_past = 54, n_tokens = 54, progress = 1.000000
slot update_slots: id  0 | task 44 | prompt done, n_past = 54, n_tokens = 54
slot      release: id  0 | task 44 | stop processing: n_past = 54, truncated = 0
slot launch_slot_: id  0 | task 45 | processing task
slot update_slots: id  0 | task 45 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 52
slot update_slots: id  0 | task 45 | kv cache rm [0, end)
slot update_slots: id  0 | task 45 | prompt processing progress, n_past = 52, n_tokens = 52, progress = 1.000000
slot update_slots: id  0 | task 45 | prompt done, n_past = 52, n_tokens = 52
slot      release: id  0 | task 45 | stop processing: n_past = 52, truncated = 0
slot launch_slot_: id  0 | task 46 | processing task
slot update_slots: id  0 | task 46 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 64
slot update_slots: id  0 | task 46 | kv cache rm [0, end)
slot update_slots: id  0 | task 46 | prompt processing progress, n_past = 64, n_tokens = 64, progress = 1.000000
slot update_slots: id  0 | task 46 | prompt done, n_past = 64, n_tokens = 64
slot      release: id  0 | task 46 | stop processing: n_past = 64, truncated = 0

<tasks keep increasing>

slot launch_slot_: id  0 | task 61 | processing task
slot update_slots: id  0 | task 61 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 68
slot update_slots: id  0 | task 61 | kv cache rm [0, end)
slot update_slots: id  0 | task 61 | prompt processing progress, n_past = 68, n_tokens = 68, progress = 1.000000
slot update_slots: id  0 | task 61 | prompt done, n_past = 68, n_tokens = 68
slot      release: id  0 | task 61 | stop processing: n_past = 68, truncated = 0
slot launch_slot_: id  0 | task 62 | processing task
slot update_slots: id  0 | task 62 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 63
slot update_slots: id  0 | task 62 | kv cache rm [0, end)
slot update_slots: id  0 | task 62 | prompt processing progress, n_past = 63, n_tokens = 63, progress = 1.000000
slot update_slots: id  0 | task 62 | prompt done, n_past = 63, n_tokens = 63
slot      release: id  0 | task 62 | stop processing: n_past = 63, truncated = 0
slot launch_slot_: id  0 | task 63 | processing task
slot update_slots: id  0 | task 63 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 56
slot update_slots: id  0 | task 63 | kv cache rm [0, end)
slot update_slots: id  0 | task 63 | prompt processing progress, n_past = 56, n_tokens = 56, progress = 1.000000
slot update_slots: id  0 | task 63 | prompt done, n_past = 56, n_tokens = 56
slot      release: id  0 | task 63 | stop processing: n_past = 56, truncated = 0
srv  update_slots: all slots are idle

The processing eventually freezes again and this are the task ids reported when the socket is closed:

srv  cancel_tasks: cancel task, id_task = 235
srv  cancel_tasks: cancel task, id_task = 234
srv  cancel_tasks: cancel task, id_task = 233
srv  cancel_tasks: cancel task, id_task = 232
srv  cancel_tasks: cancel task, id_task = 231

<tasks keep decreasing>

srv  cancel_tasks: cancel task, id_task = 176
srv  cancel_tasks: cancel task, id_task = 175
srv  cancel_tasks: cancel task, id_task = 174
srv  cancel_tasks: cancel task, id_task = 173
srv  cancel_tasks: cancel task, id_task = 172

The batch size doesn't matter. Even with a single string in each request the processing will eventually freeze.

First Bad Commit

No response

Relevant log output

llama-swap listening on :8080
[INFO] Request ::1 "GET /v1/models HTTP/1.1" 200 734 "" 46.8µs
[INFO] Request ::1 "GET /running HTTP/1.1" 200 14 "" 20.52µs
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5606 (247e5c6e) with clang version 19.1.7 for x86_64-pc-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 5803, http threads: 31
main: loading model
srv    load_model: loading model '/home/wolf/.lmstudio/models/vonjack/bge-m3-gguf/bge-m3-q8_0-bert_cpp.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 18491 MiB free
llama_model_loader: loaded meta data with 21 key-value pairs and 389 tensors from /home/wolf/.lmstudio/models/vonjack/bge-m3-gguf/bge-m3-q8_0-bert_cpp.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = BERT
llama_model_loader: - kv   2:                        general.description str              = GGML BERT model
llama_model_loader: - kv   3:                          general.file_type u32              = 7
llama_model_loader: - kv   4:                            bert.vocab_size u32              = 250002
llama_model_loader: - kv   5:                        bert.context_length u32              = 8194
llama_model_loader: - kv   6:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   7:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   8:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   9:                           bert.block_count u32              = 24
llama_model_loader: - kv  10:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                      bert.attention.causal bool             = false
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  14:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:                 tokenizer.ggml.word_prefix str              = ▁
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q8_0:  145 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 582.64 MiB (8.62 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 0
load: token to piece cache size = 2.1668 MB
print_info: arch             = bert
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8194
print_info: n_embd           = 1024
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 1.0e-05
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 4096
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8194
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 335M
print_info: model params     = 566.71 M
print_info: general.name     = BERT
print_info: vocab type       = WPM
print_info: n_vocab          = 250002
print_info: n_merges         = 0
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 102 'ta'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 103 'е'
print_info: LF token         = 0 '<s>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:        CUDA0 model buffer size =   307.22 MiB
load_tensors:   CPU_Mapped model buffer size =   275.42 MiB
.......................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8194) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.00 MiB
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv  log_server_r: request: GET /health 127.0.0.1 503
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
  {{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
  {{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:5803 - starting the main loop

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions