Skip to content

Eval bug: ggml_gallocr_reserve_n tries to allocate beyond max buffer size #11332

@BenPortner

Description

@BenPortner

Name and Version

version: 4517 (9f7add1)
built with MSVC 19.42.34436.0 for

Operating systems

Windows

GGML backends

Vulkan

Hardware

Dell Latitude 5420
Windows 10 Enterprise
CPU: 11th Gen Intel i7-1185G7 @ 3.00GHz, 4 Cores, 8 Logical Processors x86_64
RAM: 2x16GB Hynix 3200MHz DDR4 PC4-25600
GPU: Intel Iris Xe iGPU
Storage: Western Digital PC SN530 NVMe WDC 512GB M.2 SSD

Models

tensorblock/CodeQwen1.5-7B-GGUF

Problem description & steps to reproduce

I get an OOM error when trying to use CodeQwen 1.5 7B Q4_K_M with large (e.g. 65000) context sizes. Both model and KV buffers allocate just fine. However, there is another allocation during ggml_gallocr_reserve_n, which causes the error. More specifically, galloc->buf_tallocs[0]->max_size is calculated by ggml_gallocr_alloc_graph_impl. This value is used in line 766 to allocate a new buffer. Because galloc->buf_tallocs[0]->max_size is never checked against the maximum allowed buffer size of the device, the allocation fails. In my case galloc->buf_tallocs[0]->max_size evaluates to 4428140544, whereas ggml_backend_buft_get_max_size yields 4294901760.

It seems to me that this is a bug in llama.cpp because running Llama 3.2 3B Instruct Q4_0 with the same (and even larger) context sizes works without problems.

First Bad Commit

No response

Relevant log output

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Iris(R) Xe Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz)
load_backend: failed to find ggml_backend_init in .\ggml-vulkan.dll
load_backend: failed to find ggml_backend_init in .\ggml-cpu.dll
build: 4517 (9f7add1c) with MSVC 19.42.34436.0 for  (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics) - 16072 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 387 tensors from C:\Users\BNPR\Downloads\CodeQwen1.5-7B-Chat.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 92416
llama_model_loader: - kv   3:                       llama.context_length u32              = 65536
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 13440
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,92416]   = ["<unk>", "<s>", "<|endoftext|>", "<|...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,92416]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,92416]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 4
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 92298
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_0:   16 tensors
llama_model_loader: - type q8_0:   16 tensors
llama_model_loader: - type q4_K:  177 tensors
llama_model_loader: - type q6_K:   17 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.41 GiB (5.23 BPW)
load: special tokens cache size = 151
load: token to piece cache size = 0.4983 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 65536
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 13440
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 65536
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
ggml_vulkan: Compiling shadersprint_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 7.25 B
print_info: general.name     = .
print_info: vocab type       = SPM
print_info: n_vocab          = 92416
.print_info: n_merges         = 0
print_info: BOS token        = 2 '<|endoftext|>'
print_info: EOS token        = 4 '<|im_end|>'
print_info: EOT token        = 2 '<|endoftext|>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 92298 '<fim_pad>'
print_info: LF token         = 1396 '<0x0A>'
print_info: EOG token        = 2 '<|endoftext|>'
print_info: EOG token        = 4 '<|im_end|>'
print_info: max token length = 36
............................................Done!
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:      Vulkan0 model buffer size =  4314.02 MiB
load_tensors:   CPU_Mapped model buffer size =   203.06 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 65024
llama_init_from_model: n_ctx_per_seq = 65024
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (65024) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =  4064.00 MiB
llama_init_from_model: KV self size  = 4064.00 MiB, K (f16): 2032.00 MiB, V (f16): 2032.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.35 MiB
ggml_vulkan: Device memory allocation of size 4428140544 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4428140544

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions