Skip to content

Eval bug: Qwerky QwQ 32B (rwkv6qwen2) failed to load #12662

@kanttouchthis

Description

@kanttouchthis

Name and Version

version: 5002 (2c3f8b8)
built with MSVC 19.29.30158.0 for

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 5700X + RTX 3090

Models

featherless-ai/Qwerky-QwQ-32B
tested with DevQuasar/featherless-ai.Qwerky-QwQ-32B-GGUF Q4_K_S
and IQ4_XS created with convert_hf_to_gguf.py locally

Problem description & steps to reproduce

launching the server/cli fails because of llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.time_mix_w1.weight' has wrong shape; expected 5120, 320, got 5120, 640, 1, 1
I tried it with the Q4_K_S model from the hub and with a locally created IQ4_XS model. Model conversion works without any errors, but loading the model fails for both quants.

First Bad Commit

No response

Relevant log output

$ llama-server.exe -m model.gguf -ctk q8_0 -ctv q8_0 --n_gpu_layers 128 --ctx-size 32768 -fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5002 (2c3f8b85) with MSVC 19.29.30158.0 for
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
main: loading model
srv    load_model: loading model 'model.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 1283 tensors from model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = rwkv6qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Featherless ai.Qwerky QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = featherless-ai.Qwerky-QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                  rwkv6qwen2.context_length u32              = 1048576
llama_model_loader: - kv   7:                rwkv6qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                     rwkv6qwen2.block_count u32              = 64
llama_model_loader: - kv   9:                   rwkv6qwen2.wkv.head_size u32              = 128
llama_model_loader: - kv  10:              rwkv6qwen2.time_mix_extra_dim u32              = 64
llama_model_loader: - kv  11:            rwkv6qwen2.time_decay_extra_dim u32              = 128
llama_model_loader: - kv  12:             rwkv6qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  13: rwkv6qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:               rwkv6qwen2.token_shift_count u32              = 1
llama_model_loader: - kv  15:         rwkv6qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  16:            rwkv6qwen2.attention.head_count u32              = 0
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 14
llama_model_loader: - type  f32:  769 tensors
llama_model_loader: - type q4_K:  505 tensors
llama_model_loader: - type q5_K:    8 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Small
print_info: file size   = 20.25 GiB (4.98 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = rwkv6qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 1048576
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 0
print_info: n_head_kv        = 8
print_info: n_rot            = 0
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 0
print_info: n_embd_head_v    = 0
print_info: n_gqa            = 0
print_info: n_embd_k_gqa     = 0
print_info: n_embd_v_gqa     = 0
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = -1
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 1048576
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 34.95 B
print_info: general.name     = Featherless ai.Qwerky QwQ 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '─è'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.time_mix_w1.weight' has wrong shape; expected  5120,   320, got  5120,   640,     1,     1
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'model.gguf'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions