-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
version: 5606 (247e5c6)
built with clang version 19.1.7 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 9950X + RTX 4090
Models
vonjack/bge-m3-gguf/bge-m3-q8_0-bert_cpp.gguf
Problem description & steps to reproduce
I'm using llama-server via llama-swap, with the following options:
llama-server --flash-attn -ngl 999 -ngld 999 --model ... --ctx-size 4096 --embeddings --pooling cls
My application sends a POST request to the /v1/embeddings API with the following contents:
{
"input": [
"string 1",
"string 2",
...
],
"model": "embed-bge-m3"
}There are max 64 strings in any single batch.
My application starts by sending a single empty string to determine the embeddings length:
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 2, n_tokens = 2
slot release: id 0 | task 0 | stop processing: n_past = 2, truncated = 0
srv update_slots: all slots are idle
Then it starts sending the real data:
srv log_server_r: request: POST /v1/embeddings 127.0.0.1 200
slot launch_slot_: id 0 | task 2 | processing task
slot update_slots: id 0 | task 2 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id 0 | task 2 | kv cache rm [0, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id 0 | task 2 | prompt done, n_past = 7, n_tokens = 7
slot release: id 0 | task 2 | stop processing: n_past = 7, truncated = 0
slot launch_slot_: id 0 | task 3 | processing task
slot update_slots: id 0 | task 3 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 8
slot update_slots: id 0 | task 3 | kv cache rm [0, end)
slot update_slots: id 0 | task 3 | prompt processing progress, n_past = 8, n_tokens = 8, progress = 1.000000
slot update_slots: id 0 | task 3 | prompt done, n_past = 8, n_tokens = 8
slot release: id 0 | task 3 | stop processing: n_past = 8, truncated = 0
slot launch_slot_: id 0 | task 4 | processing task
slot update_slots: id 0 | task 4 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 11
slot update_slots: id 0 | task 4 | kv cache rm [0, end)
slot update_slots: id 0 | task 4 | prompt processing progress, n_past = 11, n_tokens = 11, progress = 1.000000
slot update_slots: id 0 | task 4 | prompt done, n_past = 11, n_tokens = 11
slot release: id 0 | task 4 | stop processing: n_past = 11, truncated = 0
<this continues with incrementing task numbers>
slot launch_slot_: id 0 | task 39 | processing task
slot update_slots: id 0 | task 39 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 106
slot update_slots: id 0 | task 39 | kv cache rm [0, end)
slot update_slots: id 0 | task 39 | prompt processing progress, n_past = 106, n_tokens = 106, progress = 1.000000
slot update_slots: id 0 | task 39 | prompt done, n_past = 106, n_tokens = 106
slot release: id 0 | task 39 | stop processing: n_past = 106, truncated = 0
slot launch_slot_: id 0 | task 40 | processing task
slot update_slots: id 0 | task 40 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 39
slot update_slots: id 0 | task 40 | kv cache rm [0, end)
slot update_slots: id 0 | task 40 | prompt processing progress, n_past = 39, n_tokens = 39, progress = 1.000000
slot update_slots: id 0 | task 40 | prompt done, n_past = 39, n_tokens = 39
slot release: id 0 | task 40 | stop processing: n_past = 39, truncated = 0
srv update_slots: all slots are idle
At this point llama-server freezes processing the input data and does not send any reply to my application. When the socket is closed, it can be seen that there were 64 tasks in the batch:
srv cancel_tasks: cancel task, id_task = 65
srv cancel_tasks: cancel task, id_task = 64
srv cancel_tasks: cancel task, id_task = 63
srv cancel_tasks: cancel task, id_task = 62
<tasks keep decreasing>
srv cancel_tasks: cancel task, id_task = 5
srv cancel_tasks: cancel task, id_task = 4
srv cancel_tasks: cancel task, id_task = 3
srv cancel_tasks: cancel task, id_task = 2
srv update_slots: all slots are idle
When another embeddings request is issued, the task identifiers seem to be wrong:
lot launch_slot_: id 0 | task 169 | processing task
slot update_slots: id 0 | task 169 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2
slot update_slots: id 0 | task 169 | kv cache rm [0, end)
slot update_slots: id 0 | task 169 | prompt processing progress, n_past = 2, n_tokens = 2, progress = 1.000000
slot update_slots: id 0 | task 169 | prompt done, n_past = 2, n_tokens = 2
slot release: id 0 | task 169 | stop processing: n_past = 2, truncated = 0
slot launch_slot_: id 0 | task 42 | processing task
slot update_slots: id 0 | task 42 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 75
slot update_slots: id 0 | task 42 | kv cache rm [0, end)
slot update_slots: id 0 | task 42 | prompt processing progress, n_past = 75, n_tokens = 75, progress = 1.000000
slot update_slots: id 0 | task 42 | prompt done, n_past = 75, n_tokens = 75
srv log_server_r: request: POST /v1/embeddings 127.0.0.1 200
slot release: id 0 | task 42 | stop processing: n_past = 75, truncated = 0
slot launch_slot_: id 0 | task 172 | processing task
slot update_slots: id 0 | task 172 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id 0 | task 172 | kv cache rm [0, end)
slot update_slots: id 0 | task 172 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id 0 | task 172 | prompt done, n_past = 7, n_tokens = 7
slot release: id 0 | task 172 | stop processing: n_past = 7, truncated = 0
slot launch_slot_: id 0 | task 44 | processing task
slot update_slots: id 0 | task 44 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 54
slot update_slots: id 0 | task 44 | kv cache rm [0, end)
slot update_slots: id 0 | task 44 | prompt processing progress, n_past = 54, n_tokens = 54, progress = 1.000000
slot update_slots: id 0 | task 44 | prompt done, n_past = 54, n_tokens = 54
slot release: id 0 | task 44 | stop processing: n_past = 54, truncated = 0
slot launch_slot_: id 0 | task 45 | processing task
slot update_slots: id 0 | task 45 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 52
slot update_slots: id 0 | task 45 | kv cache rm [0, end)
slot update_slots: id 0 | task 45 | prompt processing progress, n_past = 52, n_tokens = 52, progress = 1.000000
slot update_slots: id 0 | task 45 | prompt done, n_past = 52, n_tokens = 52
slot release: id 0 | task 45 | stop processing: n_past = 52, truncated = 0
slot launch_slot_: id 0 | task 46 | processing task
slot update_slots: id 0 | task 46 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 64
slot update_slots: id 0 | task 46 | kv cache rm [0, end)
slot update_slots: id 0 | task 46 | prompt processing progress, n_past = 64, n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 46 | prompt done, n_past = 64, n_tokens = 64
slot release: id 0 | task 46 | stop processing: n_past = 64, truncated = 0
<tasks keep increasing>
slot launch_slot_: id 0 | task 61 | processing task
slot update_slots: id 0 | task 61 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 68
slot update_slots: id 0 | task 61 | kv cache rm [0, end)
slot update_slots: id 0 | task 61 | prompt processing progress, n_past = 68, n_tokens = 68, progress = 1.000000
slot update_slots: id 0 | task 61 | prompt done, n_past = 68, n_tokens = 68
slot release: id 0 | task 61 | stop processing: n_past = 68, truncated = 0
slot launch_slot_: id 0 | task 62 | processing task
slot update_slots: id 0 | task 62 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 63
slot update_slots: id 0 | task 62 | kv cache rm [0, end)
slot update_slots: id 0 | task 62 | prompt processing progress, n_past = 63, n_tokens = 63, progress = 1.000000
slot update_slots: id 0 | task 62 | prompt done, n_past = 63, n_tokens = 63
slot release: id 0 | task 62 | stop processing: n_past = 63, truncated = 0
slot launch_slot_: id 0 | task 63 | processing task
slot update_slots: id 0 | task 63 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 56
slot update_slots: id 0 | task 63 | kv cache rm [0, end)
slot update_slots: id 0 | task 63 | prompt processing progress, n_past = 56, n_tokens = 56, progress = 1.000000
slot update_slots: id 0 | task 63 | prompt done, n_past = 56, n_tokens = 56
slot release: id 0 | task 63 | stop processing: n_past = 56, truncated = 0
srv update_slots: all slots are idle
The processing eventually freezes again and this are the task ids reported when the socket is closed:
srv cancel_tasks: cancel task, id_task = 235
srv cancel_tasks: cancel task, id_task = 234
srv cancel_tasks: cancel task, id_task = 233
srv cancel_tasks: cancel task, id_task = 232
srv cancel_tasks: cancel task, id_task = 231
<tasks keep decreasing>
srv cancel_tasks: cancel task, id_task = 176
srv cancel_tasks: cancel task, id_task = 175
srv cancel_tasks: cancel task, id_task = 174
srv cancel_tasks: cancel task, id_task = 173
srv cancel_tasks: cancel task, id_task = 172
The batch size doesn't matter. Even with a single string in each request the processing will eventually freeze.
First Bad Commit
No response
Relevant log output
llama-swap listening on :8080
[INFO] Request ::1 "GET /v1/models HTTP/1.1" 200 734 "" 46.8µs
[INFO] Request ::1 "GET /running HTTP/1.1" 200 14 "" 20.52µs
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5606 (247e5c6e) with clang version 19.1.7 for x86_64-pc-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 5803, http threads: 31
main: loading model
srv load_model: loading model '/home/wolf/.lmstudio/models/vonjack/bge-m3-gguf/bge-m3-q8_0-bert_cpp.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 18491 MiB free
llama_model_loader: loaded meta data with 21 key-value pairs and 389 tensors from /home/wolf/.lmstudio/models/vonjack/bge-m3-gguf/bge-m3-q8_0-bert_cpp.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.name str = BERT
llama_model_loader: - kv 2: general.description str = GGML BERT model
llama_model_loader: - kv 3: general.file_type u32 = 7
llama_model_loader: - kv 4: bert.vocab_size u32 = 250002
llama_model_loader: - kv 5: bert.context_length u32 = 8194
llama_model_loader: - kv 6: bert.embedding_length u32 = 1024
llama_model_loader: - kv 7: bert.feed_forward_length u32 = 4096
llama_model_loader: - kv 8: bert.attention.head_count u32 = 16
llama_model_loader: - kv 9: bert.block_count u32 = 24
llama_model_loader: - kv 10: bert.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 11: bert.attention.causal bool = false
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 1
llama_model_loader: - kv 13: tokenizer.ggml.model str = bert
llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.word_prefix str = ▁
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 243 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q8_0: 145 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 582.64 MiB (8.62 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 0
load: token to piece cache size = 2.1668 MB
print_info: arch = bert
print_info: vocab_only = 0
print_info: n_ctx_train = 8194
print_info: n_embd = 1024
print_info: n_layer = 24
print_info: n_head = 16
print_info: n_head_kv = 16
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 1.0e-05
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 4096
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8194
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 335M
print_info: model params = 566.71 M
print_info: general.name = BERT
print_info: vocab type = WPM
print_info: n_vocab = 250002
print_info: n_merges = 0
print_info: BOS token = 0 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 3 '<unk>'
print_info: SEP token = 102 'ta'
print_info: PAD token = 1 '<pad>'
print_info: MASK token = 103 'е'
print_info: LF token = 0 '<s>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors: CUDA0 model buffer size = 307.22 MiB
load_tensors: CPU_Mapped model buffer size = 275.42 MiB
.......................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = 1
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8194) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.00 MiB
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv log_server_r: request: GET /health 127.0.0.1 503
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
{{- '<|im_start|>' + message.role + '
' + message.content + '<|im_end|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|im_start|>assistant
' -}}
{%- endif -%}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:5803 - starting the main loop