Skip to content

Conversation

slaren
Copy link
Member

@slaren slaren commented Feb 21, 2024

Apply the same solution as with falcon to allow offloading the output tensor.

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl test t/s master t/s PR speedup
gemma 7B Q4_K - Medium 4.77 GiB 8.54 B CUDA 99 pp 512 1996.45 ± 206.39 3959.67 ± 110.36 1.98
gemma 7B Q4_K - Medium 4.77 GiB 8.54 B CUDA 99 tg 128 60.60 ± 0.26 110.51 ± 0.30 1.82

@JohannesGaessler for some reason, compare-llama-bench.py does not work in this case. Do you know what may be reason?

Nvm, I think this is because this change caused the reported model size/params count to change.

Traceback (most recent call last):
  File "/home/diego/code/llama.cpp/scripts/compare-llama-bench.py", line 305, in <module>
    gpu_blas = bool(rows_full[0][KEY_PROPERTIES.index("gpu_blas")])
IndexError: list index out of range

This query seems to return 0 rows:

SELECT tb.cpu_info, tb.gpu_info, tb.n_gpu_layers, tb.main_gpu, tb.cuda, tb.opencl, tb.metal, tb.gpu_blas, tb.blas, tb.model_filename, tb.model_type, tb.model_size, tb.model_n_params, tb.n_batch, tb.n_threads, tb.type_k, tb.type_v, tb.no_kv_offload, tb.mul_mat_q, tb.tensor_split, tb.n_prompt, tb.n_gen, tb.n_prompt, tb.n_gen, AVG(tb.avg_ts), AVG(tc.avg_ts) FROM test tb JOIN test tc ON tb.cpu_info = tc.cpu_info AND tb.gpu_info = tc.gpu_info AND tb.n_gpu_layers = tc.n_gpu_layers AND tb.main_gpu = tc.main_gpu AND tb.cuda = tc.cuda AND tb.opencl = tc.opencl AND tb.metal = tc.metal AND tb.gpu_blas = tc.gpu_blas AND tb.blas = tc.blas AND tb.model_filename = tc.model_filename AND tb.model_type = tc.model_type AND tb.model_size = tc.model_size AND tb.model_n_params = tc.model_n_params AND tb.n_batch = tc.n_batch AND tb.n_threads = tc.n_threads AND tb.type_k = tc.type_k AND tb.type_v = tc.type_v AND tb.no_kv_offload = tc.no_kv_offload AND tb.mul_mat_q = tc.mul_mat_q AND tb.tensor_split = tc.tensor_split AND tb.n_prompt = tc.n_prompt AND tb.n_gen = tc.n_gen AND tb.build_commit = '89febfed' AND tc.build_commit = '22ca4ddb' GROUP BY tb.cpu_info, tb.gpu_info, tb.n_gpu_layers, tb.main_gpu, tb.cuda, tb.opencl, tb.metal, tb.gpu_blas, tb.blas, tb.model_filename, tb.model_type, tb.model_size, tb.model_n_params, tb.n_batch, tb.n_threads, tb.type_k, tb.type_v, tb.no_kv_offload, tb.mul_mat_q, tb.tensor_split, tb.n_prompt, tb.n_gen, tb.n_gen, tb.n_prompt ORDER BY tb.cpu_info, tb.gpu_info, tb.n_gpu_layers, tb.main_gpu, tb.cuda, tb.opencl, tb.metal, tb.gpu_blas, tb.blas, tb.model_filename, tb.model_type, tb.model_size, tb.model_n_params, tb.n_batch, tb.n_threads, tb.type_k, tb.type_v, tb.no_kv_offload, tb.mul_mat_q, tb.tensor_split, tb.n_prompt, tb.n_gen, tb.n_gen, tb.n_prompt;

This is the data in the tables:

sqlite> select * from test;
89febfed|2230|1|0|0|0|0|0|1|1|13th Gen Intel(R) Core(TM) i9-13900K|NVIDIA GeForce RTX 3090 Ti|models/gemma-7b-it-Q4_K_M.gguf|gemma 7B Q4_K - Medium|5121183744|8538074112|512|16|f16|f16|99|layer|0|0|1|0.00|1|512|0|2024-02-21T20:40:28Z|268801109|48015341|1944.929686|281.264199
89febfed|2230|1|0|0|0|0|0|1|1|13th Gen Intel(R) Core(TM) i9-13900K|NVIDIA GeForce RTX 3090 Ti|models/gemma-7b-it-Q4_K_M.gguf|gemma 7B Q4_K - Medium|5121183744|8538074112|512|16|f16|f16|99|layer|0|0|1|0.00|1|0|128|2024-02-21T20:40:30Z|2103439045|12619252|60.854476|0.364524
22ca4ddb|2231|1|0|0|0|0|0|1|1|13th Gen Intel(R) Core(TM) i9-13900K|NVIDIA GeForce RTX 3090 Ti|models/gemma-7b-it-Q4_K_M.gguf|gemma 7B Q4_K - Medium|5563772928|9324899328|512|16|f16|f16|99|layer|0|0|1|0.00|1|512|0|2024-02-21T20:40:42Z|128998178|2139241|3969.902909|64.418657
22ca4ddb|2231|1|0|0|0|0|0|1|1|13th Gen Intel(R) Core(TM) i9-13900K|NVIDIA GeForce RTX 3090 Ti|models/gemma-7b-it-Q4_K_M.gguf|gemma 7B Q4_K - Medium|5563772928|9324899328|512|16|f16|f16|99|layer|0|0|1|0.00|1|0|128|2024-02-21T20:40:43Z|1162848942|3929889|110.075492|0.371931

@slaren slaren merged commit ba2135c into master Feb 21, 2024
@slaren slaren deleted the sl/gemma-offload-output branch February 21, 2024 21:18
cebtenzzre pushed a commit to nomic-ai/llama.cpp that referenced this pull request Feb 21, 2024
cebtenzzre added a commit to nomic-ai/gpt4all that referenced this pull request Feb 21, 2024
@cebtenzzre
Copy link
Collaborator

This change breaks CPU inference of the Q4_0 quant, with or without #5650:

llm_load_tensors: ggml ctx size =    0.10 MiB
ggml_new_object: not enough space in the context's memory pool (needed 101968, available 101600)
[1]    60495 segmentation fault (core dumped)

@slaren
Copy link
Member Author

slaren commented Feb 21, 2024

Should be fixed in #5651

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants