-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Labels
bug-unconfirmedhigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Description
What happened?
Getting a consistent missing tensor blk.0.ffn_down_exps.weight error when trying to load mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf (from https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main ).
I checked that my shasum is correct (same as in huggingface).
I bisected the commits and found the regression introduced in c5b0f4b
The "proof" of the regression is in the attached terminal transcript.
Name and Version
latest llama
What operating system are you seeing the problem on?
Mac
Relevant log output
% git log -n 1 ; ((make clean && make -j 10) | tail -n 2) && ./llama-cli -n 100 -m ~/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -p 'haha' 2>&1 | tail -n 10
commit c5b0f4b5d90297f3e729fca7f78ddb25fcab5ddc (HEAD, tag: b3990)
Author: Diego Devesa <[email protected]>
Date: Wed Oct 30 02:01:23 2024 +0100
llama : refactor model loader with backend registry (#10026)
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o examples/cvector-generator/cvector-generator.o -o llama-cvector-generator -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o -Iexamples/server examples/server/server.o -o llama-server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
( make clean && make -j 10; ) 106.34s user 5.96s system 440% cpu 25.473 total
0.00s user 0.00s system 0% cpu 25.473 total
( ( make clean && make -j 10; ) | tail -n 2; ) 106.34s user 5.96s system 440% cpu 25.474 total
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llama_model_load: error loading model: missing tensor blk.0.ffn_down_exps.weight
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model '/Users/redacted/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf'
main: error: unable to load model
% git reset --hard HEAD~
HEAD is now at 8f275a7c ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)
% git log -n 1 ; ((make clean && make -j 10) | tail -n 2) && ./llama-cli -n 100 -m ~/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf -p 'haha' 2>&1 | tail -n 10
commit 8f275a7c4593aa34147595a90282cf950a853690 (HEAD, tag: b3989)
Author: Changyeon Kim <[email protected]>
Date: Tue Oct 29 17:52:56 2024 +0900
ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)
* ggml: Add POOL2D OP for GPU ACC to the Vulkan.
- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.
Signed-off-by: Changyeon Kim <[email protected]>
* [fix] Correct the incorrect order of the parameters.
fix casting to int.
Signed-off-by: Changyeon Kim <[email protected]>
---------
Signed-off-by: Changyeon Kim <[email protected]>
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o examples/speculative/speculative.o -o llama-speculative -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_AMX -DGGML_USE_METAL -DGGML_METAL_EMBED_LIBRARY ggml/src/ggml-blas.o ggml/src/llamafile/sgemm.o ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o ggml/src/ggml-metal.o ggml/src/ggml-metal-embed.o ggml/src/ggml.o ggml/src/ggml-alloc.o ggml/src/ggml-backend.o ggml/src/ggml-quants.o ggml/src/ggml-aarch64.o src/llama.o src/llama-vocab.o src/llama-grammar.o src/llama-sampling.o src/unicode.o src/unicode-data.o common/common.o common/arg.o common/log.o common/console.o common/ngram-cache.o common/sampling.o common/train.o common/build-info.o common/json-schema-to-grammar.o -Iexamples/server examples/server/server.o -o llama-server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
( make clean && make -j 10; ) 107.50s user 6.14s system 436% cpu 26.024 total
0.00s user 0.00s system 0% cpu 26.023 total
( ( make clean && make -j 10; ) | tail -n 2; ) 107.50s user 6.14s system 436% cpu 26.025 total
as far as the 'concept' of the god of the bible, i think most people are familiar enough with it, and the common interpretations of it, to discuss it here.
i dont know why this board cant just have normal discussions without people trying to
llama_perf_sampler_print: sampling time = 2.12 ms / 103 runs ( 0.02 ms per token, 48584.91 tokens per second)
llama_perf_context_print: load time = 6704.64 ms
llama_perf_context_print: prompt eval time = 113.42 ms / 3 tokens ( 37.80 ms per token, 26.45 tokens per second)
llama_perf_context_print: eval time = 3721.91 ms / 99 runs ( 37.60 ms per token, 26.60 tokens per second)
llama_perf_context_print: total time = 3843.64 ms / 102 tokens
ggml_metal_free: deallocatinggerardoims
Metadata
Metadata
Assignees
Labels
bug-unconfirmedhigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)