-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
version: 4517 (9f7add1)
built with MSVC 19.42.34436.0 for
Operating systems
Windows
GGML backends
Vulkan
Hardware
Dell Latitude 5420
Windows 10 Enterprise
CPU: 11th Gen Intel i7-1185G7 @ 3.00GHz, 4 Cores, 8 Logical Processors x86_64
RAM: 2x16GB Hynix 3200MHz DDR4 PC4-25600
GPU: Intel Iris Xe iGPU
Storage: Western Digital PC SN530 NVMe WDC 512GB M.2 SSD
Models
tensorblock/CodeQwen1.5-7B-GGUF
Problem description & steps to reproduce
I get an OOM error when trying to use CodeQwen 1.5 7B Q4_K_M with large (e.g. 65000) context sizes. Both model and KV buffers allocate just fine. However, there is another allocation during ggml_gallocr_reserve_n, which causes the error. More specifically, galloc->buf_tallocs[0]->max_size is calculated by ggml_gallocr_alloc_graph_impl. This value is used in line 766 to allocate a new buffer. Because galloc->buf_tallocs[0]->max_size is never checked against the maximum allowed buffer size of the device, the allocation fails. In my case galloc->buf_tallocs[0]->max_size evaluates to 4428140544, whereas ggml_backend_buft_get_max_size yields 4294901760.
It seems to me that this is a bug in llama.cpp because running Llama 3.2 3B Instruct Q4_0 with the same (and even larger) context sizes works without problems.
First Bad Commit
No response
Relevant log output
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Iris(R) Xe Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz)
load_backend: failed to find ggml_backend_init in .\ggml-vulkan.dll
load_backend: failed to find ggml_backend_init in .\ggml-cpu.dll
build: 4517 (9f7add1c) with MSVC 19.42.34436.0 for (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics) - 16072 MiB free
llama_model_loader: loaded meta data with 25 key-value pairs and 387 tensors from C:\Users\BNPR\Downloads\CodeQwen1.5-7B-Chat.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 92416
llama_model_loader: - kv 3: llama.context_length u32 = 65536
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 13440
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,92416] = ["<unk>", "<s>", "<|endoftext|>", "<|...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,92416] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,92416] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 4
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 92298
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 23: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_0: 16 tensors
llama_model_loader: - type q8_0: 16 tensors
llama_model_loader: - type q4_K: 177 tensors
llama_model_loader: - type q6_K: 17 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.41 GiB (5.23 BPW)
load: special tokens cache size = 151
load: token to piece cache size = 0.4983 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 65536
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 13440
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 65536
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
ggml_vulkan: Compiling shadersprint_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 7.25 B
print_info: general.name = .
print_info: vocab type = SPM
print_info: n_vocab = 92416
.print_info: n_merges = 0
print_info: BOS token = 2 '<|endoftext|>'
print_info: EOS token = 4 '<|im_end|>'
print_info: EOT token = 2 '<|endoftext|>'
print_info: UNK token = 0 '<unk>'
print_info: PAD token = 92298 '<fim_pad>'
print_info: LF token = 1396 '<0x0A>'
print_info: EOG token = 2 '<|endoftext|>'
print_info: EOG token = 4 '<|im_end|>'
print_info: max token length = 36
............................................Done!
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: Vulkan0 model buffer size = 4314.02 MiB
load_tensors: CPU_Mapped model buffer size = 203.06 MiB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 65024
llama_init_from_model: n_ctx_per_seq = 65024
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (65024) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65024, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: Vulkan0 KV buffer size = 4064.00 MiB
llama_init_from_model: KV self size = 4064.00 MiB, K (f16): 2032.00 MiB, V (f16): 2032.00 MiB
llama_init_from_model: Vulkan_Host output buffer size = 0.35 MiB
ggml_vulkan: Device memory allocation of size 4428140544 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 4428140544