Bug: missing tensor blk.0.ffn_down_exps.weight when loading mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

### What happened?

Getting a consistent `missing tensor blk.0.ffn_down_exps.weight` error when trying to load mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf  (from https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main ).

I checked that my shasum is correct (same as in huggingface).
I bisected the commits and found the regression introduced in c5b0f4b5d90297f3e729fca7f78ddb25fcab5ddc

The "proof" of the regression is in the attached terminal transcript.

### Name and Version

latest llama

### What operating system are you seeing the problem on?

Mac

### Relevant log output

```shell
% git log -n 1 ; ((make clean && make -j 10) | tail -n 2) && ./llama-cli -n 100 -m ~/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -p 'haha' 2>&1 | tail -n 10

commit c5b0f4b5d90297f3e729fca7f78ddb25fcab5ddc (HEAD, tag: b3990)
Author: Diego Devesa <slarengh@gmail.com>
Date:   Wed Oct 30 02:01:23 2024 +0100

    llama : refactor model loader with backend registry (#10026)
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o examples/cvector-generator/cvector-generator.o -o llama-cvector-generator -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o -Iexamples/server examples/server/server.o -o llama-server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
( make clean && make -j 10; )  106.34s user 5.96s system 440% cpu 25.473 total
  0.00s user 0.00s system 0% cpu 25.473 total
( ( make clean && make -j 10; ) | tail -n 2; )  106.34s user 5.96s system 440% cpu 25.474 total
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llama_model_load: error loading model: missing tensor blk.0.ffn_down_exps.weight
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model '/Users/redacted/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf'
main: error: unable to load model

% git reset --hard HEAD~
HEAD is now at 8f275a7c ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)

% git log -n 1 ; ((make clean && make -j 10) | tail -n 2) && ./llama-cli -n 100 -m ~/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -p 'haha' 2>&1 | tail -n 10

commit 8f275a7c4593aa34147595a90282cf950a853690 (HEAD, tag: b3989)
Author: Changyeon Kim <cyzero.kim@samsung.com>
Date:   Tue Oct 29 17:52:56 2024 +0900

    ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)

    * ggml: Add POOL2D OP for GPU ACC to the Vulkan.

    - The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
    - A GGML_OP_POOL_2D shader has been added. (Pooling)
    - The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.

    Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

    * [fix] Correct the incorrect order of the parameters.

    fix casting to int.

    Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

    ---------

    Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o examples/speculative/speculative.o -o llama-speculative -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY  ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o -Iexamples/server examples/server/server.o -o llama-server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
( make clean && make -j 10; )  107.50s user 6.14s system 436% cpu 26.024 total
  0.00s user 0.00s system 0% cpu 26.023 total
( ( make clean && make -j 10; ) | tail -n 2; )  107.50s user 6.14s system 436% cpu 26.025 total
as far as the 'concept' of the god of the bible, i think most people are familiar enough with it, and the common interpretations of it, to discuss it here.

i dont know why this board cant just have normal discussions without people trying to

llama_perf_sampler_print:    sampling time =       2.12 ms /   103 runs   (    0.02 ms per token, 48584.91 tokens per second)
llama_perf_context_print:        load time =    6704.64 ms
llama_perf_context_print: prompt eval time =     113.42 ms /     3 tokens (   37.80 ms per token,    26.45 tokens per second)
llama_perf_context_print:        eval time =    3721.91 ms /    99 runs   (   37.60 ms per token,    26.60 tokens per second)
llama_perf_context_print:       total time =    3843.64 ms /   102 tokens
ggml_metal_free: deallocating
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: missing tensor blk.0.ffn_down_exps.weight when loading mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf #10244

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: missing tensor blk.0.ffn_down_exps.weight when loading mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf #10244

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions