gemma : allow offloading the output tensor #5646
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Apply the same solution as with falcon to allow offloading the output tensor.
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
@JohannesGaessler for some reason,compare-llama-bench.py
does not work in this case. Do you know what may be reason?Nvm, I think this is because this change caused the reported model size/params count to change.
This query seems to return 0 rows:
SELECT tb.cpu_info, tb.gpu_info, tb.n_gpu_layers, tb.main_gpu, tb.cuda, tb.opencl, tb.metal, tb.gpu_blas, tb.blas, tb.model_filename, tb.model_type, tb.model_size, tb.model_n_params, tb.n_batch, tb.n_threads, tb.type_k, tb.type_v, tb.no_kv_offload, tb.mul_mat_q, tb.tensor_split, tb.n_prompt, tb.n_gen, tb.n_prompt, tb.n_gen, AVG(tb.avg_ts), AVG(tc.avg_ts) FROM test tb JOIN test tc ON tb.cpu_info = tc.cpu_info AND tb.gpu_info = tc.gpu_info AND tb.n_gpu_layers = tc.n_gpu_layers AND tb.main_gpu = tc.main_gpu AND tb.cuda = tc.cuda AND tb.opencl = tc.opencl AND tb.metal = tc.metal AND tb.gpu_blas = tc.gpu_blas AND tb.blas = tc.blas AND tb.model_filename = tc.model_filename AND tb.model_type = tc.model_type AND tb.model_size = tc.model_size AND tb.model_n_params = tc.model_n_params AND tb.n_batch = tc.n_batch AND tb.n_threads = tc.n_threads AND tb.type_k = tc.type_k AND tb.type_v = tc.type_v AND tb.no_kv_offload = tc.no_kv_offload AND tb.mul_mat_q = tc.mul_mat_q AND tb.tensor_split = tc.tensor_split AND tb.n_prompt = tc.n_prompt AND tb.n_gen = tc.n_gen AND tb.build_commit = '89febfed' AND tc.build_commit = '22ca4ddb' GROUP BY tb.cpu_info, tb.gpu_info, tb.n_gpu_layers, tb.main_gpu, tb.cuda, tb.opencl, tb.metal, tb.gpu_blas, tb.blas, tb.model_filename, tb.model_type, tb.model_size, tb.model_n_params, tb.n_batch, tb.n_threads, tb.type_k, tb.type_v, tb.no_kv_offload, tb.mul_mat_q, tb.tensor_split, tb.n_prompt, tb.n_gen, tb.n_gen, tb.n_prompt ORDER BY tb.cpu_info, tb.gpu_info, tb.n_gpu_layers, tb.main_gpu, tb.cuda, tb.opencl, tb.metal, tb.gpu_blas, tb.blas, tb.model_filename, tb.model_type, tb.model_size, tb.model_n_params, tb.n_batch, tb.n_threads, tb.type_k, tb.type_v, tb.no_kv_offload, tb.mul_mat_q, tb.tensor_split, tb.n_prompt, tb.n_gen, tb.n_gen, tb.n_prompt;
This is the data in the tables: