Skip to content

Eval bug: Static build failing to initialize Metal backend on macOS #11669

@johnbean393

Description

@johnbean393

Name and Version

myusername@Mac bin % ./llama-server --version
version: 4641 (9f4cc8f)
built with Apple clang version 16.0.0 (clang-1600.0.26.4) for arm64-apple-darwin24.3.0

Operating systems

Mac

GGML backends

Metal

Hardware

Apple M2 Max

Models

All models

Problem description & steps to reproduce

I needed a static build of llama-server, so I cloned the repository and built it locally via cmake using this command: cmake -B build -DBUILD_SHARED_LIBS=OFF; cmake --build build --config Release -j 12 -t "llama-server".

I then proceeded to test the binary with a number of models, but the server never started successfully, citing a failure initializing the Metal backend. The output can be found below in the relevant log output section.

First Bad Commit

No response

Relevant log output

(base) bj@Mac bin % ./llama-server --model /Users/bj/Library/Application\ Support/Magic\ Sorter/Sorted\ Land/Computer\ DN/AI/Text\ Generation/Models/LLaMa\ 3.2/gguf/Meta-Llama-3.2-1B-Instruct-Q8_0.gguf
build: 4641 (9f4cc8f8) with Apple clang version 16.0.0 (clang-1600.0.26.4) for arm64-apple-darwin24.3.0
system info: n_threads = 8, n_threads_batch = 8, total_threads = 12

system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | MATMUL_INT8 = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
srv    load_model: loading model '/Users/bj/Library/Application Support/Magic Sorter/Sorted Land/Computer DN/AI/Text Generation/Models/LLaMa 3.2/gguf/Meta-Llama-3.2-1B-Instruct-Q8_0.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 24575 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /Users/bj/Library/Application Support/Magic Sorter/Sorted Land/Computer DN/AI/Text Generation/Models/LLaMa 3.2/gguf/Meta-Llama-3.2-1B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 16
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q8_0:  113 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 1.22 GiB (8.50 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2048
print_info: n_layer          = 16
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 1.24 B
print_info: general.name     = Llama 3.2 1B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors: Metal_Mapped model buffer size =  1252.43 MiB
load_tensors:   CPU_Mapped model buffer size =   266.16 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:61:35: error: unknown type name 'block_q4_0'
void dequantize_q4_0(device const block_q4_0 * xb, short il, thread type4x4 & reg) {
                                  ^
program_source:80:38: error: unknown type name 'block_q4_0'
void dequantize_q4_0_t4(device const block_q4_0 * xb, short il, thread type4 & reg) {
                                     ^
program_source:95:35: error: unknown type name 'block_q4_1'
void dequantize_q4_1(device const block_q4_1 * xb, short il, thread type4x4 & reg) {
                                  ^
program_source:114:38: error: unknown type name 'block_q4_1'
void dequantize_q4_1_t4(device const block_q4_1 * xb, short il, thread type4 & reg) {
                                     ^
program_source:129:35: error: unknown type name 'block_q5_0'
void dequantize_q5_0(device const block_q5_0 * xb, short il, thread type4x4 & reg) {
                                  ^
program_source:161:38: error: unknown type name 'block_q5_0'
void dequantize_q5_0_t4(device const block_q5_0 * xb, short il, thread type4 & reg) {
                                     ^
program_source:191:35: error: unknown type name 'block_q5_1'
void dequantize_q5_1(device const block_q5_1 * xb, short il, thread type4x4 & reg) {
                                  ^
program_source:223:38: error: unknown type name 'block_q5_1'
void dequantize_q5_1_t4(device const block_q5_1 * xb, short il, thread type4 & reg) {
                                     ^
program_source:253:35: error: unknown type name 'block_q8_0'
void dequantize_q8_0(device const block_q8_0 *xb, short il, thread type4x4 & reg) {
                                  ^
program_source:267:38: error: unknown type name 'block_q8_0'
void dequantize_q8_0_t4(device const block_q8_0 *xb, short il, thread type4 & reg) {
                                     ^
program_source:277:35: error: unknown type name 'block_q2_K'
void dequantize_q2_K(device const block_q2_K *xb, short il, thread type4x4 & reg) {
                                  ^
program_source:296:35: error: unknown type name 'block_q3_K'
void dequantize_q3_K(device const block_q3_K *xb, short il, thread type4x4 & reg) {
                                  ^
program_source:330:35: error: unknown type name 'block_q4_K'
void dequantize_q4_K(device const block_q4_K * xb, short il, thread type4x4 & reg) {
                                  ^
program_source:349:35: error: unknown type name 'block_q5_K'
void dequantize_q5_K(device const block_q5_K *xb, short il, thread type4x4 & reg) {
                                  ^
program_source:372:35: error: unknown type name 'block_q6_K'
void dequantize_q6_K(device const block_q6_K *xb, short il, thread type4x4 & reg) {
                                  ^
program_source:396:38: error: unknown type name 'block_iq2_xxs'
void dequantize_iq2_xxs(device const block_iq2_xxs * xb, short il, thread type4x4 & reg) {
                                     ^
program_source:408:52: error: use of undeclared identifier 'iq2xxs_grid'
    constant uint8_t * grid = (constant uint8_t *)(iq2xxs_grid + aux8[2*il+0]);
                                                   ^
program_source:409:21: error: use of undeclared identifier 'ksigns_iq2xs'
    uint8_t signs = ksigns_iq2xs[(aux32_s >> 14*il) & 127];
                    ^
program_source:411:49: error: use of undeclared identifier 'kmask_iq2xs'
        reg[i/4][i%4] = dl * grid[i] * (signs & kmask_iq2xs[i] ? -1.f : 1.f);
                                                ^
program_source:413:33: error: use of undeclared identifier 'iq2xxs_grid'
    grid = (constant uint8_t *)(iq2xxs_grid + aux8[2*il+1]);
                                ^
program_source:414:13: error: use of undeclared identifier 'ksigns_iq2xs'
    signs = ksigns_iq2xs[(aux32_s >> (14*il+7)) & 127];
            ^
program_source:416:51: error: use of undeclared identifier 'kmask_iq2xs'
        reg[2+i/4][i%4] = dl * grid[i] * (signs & kmask_iq2xs[i] ? -1.f : 1.f);
                                                  ^
program_source:421:37: error: unknown type name 'block_iq2_xs'
void dequantize_iq2_xs(device const block_iq2_xs * xb, short il, thread type4x4 & reg) {
                                    ^
program_source:429:52: error: use of undeclared identifier 'iq2xs_grid'
    constant uint8_t * grid = (constant uint8_t *)(iq2xs_grid + (q2[2*il+0] & 511));
                                                   ^
program_source:430:21: error: use of undeclared identifier 'ksigns_iq2xs'
    uint8_t signs = ksigns_iq2xs[q2[2*il+0] >> 9];
                    ^
.
.
.
More similar errors here (could not include it in the issue due to character limit)
.
.
.
            ^
program_source:6634:82: error: explicit instantiation of 'kernel_mul_mv_id' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mv_id_iq4_xs_f32")]]  kernel kernel_mul_mv_id_t kernel_mul_mv_id<mmv_fn<kernel_mul_mv_iq4_xs_f32_impl>>;
                                                                                 ^
program_source:6547:13: note: explicit instantiation refers here
kernel void kernel_mul_mv_id(
            ^
}
ggml_backend_metal_device_init: error: failed to allocate context
llama_init_from_model: failed to initialize Metal backend
common_init_from_params: failed to create context with model '/Users/bj/Library/Application Support/Magic Sorter/Sorted Land/Computer DN/AI/Text Generation/Models/LLaMa 3.2/gguf/Meta-Llama-3.2-1B-Instruct-Q8_0.gguf'
srv    load_model: failed to load model, '/Users/bj/Library/Application Support/Magic Sorter/Sorted Land/Computer DN/AI/Text Generation/Models/LLaMa 3.2/gguf/Meta-Llama-3.2-1B-Instruct-Q8_0.gguf'
main: exiting due to model loading error

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions