Skip to content

Eval bug: GGML_ASSERT(ggml_is_contiguous(a)) with Jina reranker model #15895

@deiteris

Description

@deiteris

Name and Version

.\llama-server.exe --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
version: 6428 (a972fae)
built with MSVC 19.38.33134.0 for x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 7 5800H + AMD Radeon RX 6600M

Models

jina-reranker-v2-base-multilingual

Problem description & steps to reproduce

When I try to serve the Jina reranker model, it throws C:\Sources\llama.cpp\ggml\src\ggml.c:3435: GGML_ASSERT(ggml_is_contiguous(a)) failed error. Tried enabling/disabling FA, changing context/batch size with no success as well.

First Bad Commit

No response

Relevant log output

PS C:\Sources\llama.cpp\build\bin\Release> .\llama-server.exe --reranking -ub 1024 -b 1024 -c 1024 --host 192.168.100.21 --port 8081 -m C:\Temp\Jina-Reranker-v2-Base-Multilingual-278M-Q8_0.gguf -ngl 99 -fa 0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 6428 (a972faebe) with MSVC 19.38.33134.0 for x64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 192.168.100.21, port: 8081, http threads: 15
main: loading model
srv    load_model: loading model 'C:\Temp\Jina-Reranker-v2-Base-Multilingual-278M-Q8_0.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6600M) - 7360 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 153 tensors from C:\Temp\Jina-Reranker-v2-Base-Multilingual-278M-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Jina Reranker v2 Base Multilingual
llama_model_loader: - kv   3:                       general.organization str              = Jinaai
llama_model_loader: - kv   4:                         general.size_label str              = 278M
llama_model_loader: - kv   5:                            general.license str              = cc-by-nc-4.0
llama_model_loader: - kv   6:                               general.tags arr[str,5]       = ["transformers", "reranker", "cross-e...
llama_model_loader: - kv   7:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv   8:                           bert.block_count u32              = 12
llama_model_loader: - kv   9:                        bert.context_length u32              = 1024
llama_model_loader: - kv  10:                      bert.embedding_length u32              = 768
llama_model_loader: - kv  11:                   bert.feed_forward_length u32              = 3072
llama_model_loader: - kv  12:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv  13:          bert.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                      bert.attention.causal bool             = false
llama_model_loader: - kv  16:              bert.classifier.output_labels arr[str,1]       = ["LABEL_0"]
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,250002]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  21:                      tokenizer.ggml.scores arr[f32,250002]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,250002]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  24:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  25:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  26:        tokenizer.ggml.precompiled_charsmap arr[u8,237539]   = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  30:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  32:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  34:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - type  f32:  102 tensors
llama_model_loader: - type q8_0:   51 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 284.68 MiB (8.58 BPW)
load: model vocab missing newline token, using special_pad_id instead
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 5
load: token to piece cache size = 2.1668 MB
print_info: arch             = bert
print_info: vocab_only       = 0
print_info: n_ctx_train      = 1024
print_info: n_embd           = 768
print_info: n_layer          = 12
print_info: n_head           = 12
print_info: n_head_kv        = 12
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 768
print_info: n_embd_v_gqa     = 768
print_info: f_norm_eps       = 1.0e-05
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 1024
print_info: rope_finetuned   = unknown
print_info: n_cls_out        = 1
print_info: cls_label[ 0]    = LABEL_0
print_info: model type       = 109M
print_info: model params     = 278.44 M
print_info: general.name     = Jina Reranker v2 Base Multilingual
print_info: vocab type       = UGM
print_info: n_vocab          = 250002
print_info: n_merges         = 0
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 250001 '<mask>'
print_info: LF token         = 0 '<s>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors:      Vulkan0 model buffer size =    87.12 MiB
load_tensors:   CPU_Mapped model buffer size =   197.56 MiB
.................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 1024
llama_context: n_ctx_per_seq = 1024
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 1024
llama_context: causal_attn   = 0
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: Vulkan_Host  output buffer size =     0.96 MiB
C:\Sources\llama.cpp\ggml\src\ggml.c:3435: GGML_ASSERT(ggml_is_contiguous(a)) failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions