-
Notifications
You must be signed in to change notification settings - Fork 13.7k
k_quants tuning for Falcon-7b #2816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for causing more work for you; I thought I had checked QK_K=64 but it seems I forgot. I would have fixed it myself but I didn't work on llama.cpp the last few days.
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result.
f547c58 to
061f777
Compare
Keep in mind that mul_mat_q reduces VRAM usage and thus allows you to run better quantization though. So I would argue that with the same hardware you can still achieve better perplexity.
The overwhelming majority of users are running LLaMA-based models and I think the defaults should reflect that. So I think mul_mat_q should remain the default. |
|
I just remembered: the |
This is highly likely to be causing problems. On Metal, building with https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf One can explicitly use "precise" math functions by calling Simply changing the kernel to use |
|
I'm not sure what |
* Make ggml-cuda.cu build with QK_K = 64 Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result. * k_quants tuning for Falcon-7b --------- Co-authored-by: Iwan Kawrakow <[email protected]>
| int nx = tensor->ne[0]; | ||
| if (nx % QK_K == 0) { | ||
| if (model.arch == LLM_ARCH_FALCON || nx % QK_K != 0) { | ||
| new_type = GGML_TYPE_Q8_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use Q8_0 when GGML_USE_K_QUANTS is disabled?


Falcon-7b requires using k-qunats super-blocks of
QK_K=64instead of the usualQK_K=256(LLAMA_QKK_64=ONwhen building). This PRLLAMA_CUDA_FORCE_DMMV=ONand run with-nommq(CUDA does not build when QK_K = 64 #2815) There are also many warnings when compilingggml-cuda.cu.QK_K = 256andQK_K = 64Q8_0quantization of theoutput.weighttensor for Falcon models for all quantization types. This makes a huge difference forQ4/5_0/1. For instance,Q4_0perplexity becomes 7.2451 from 8.3948 without the changes in this PR! ForQ5_0the change is from 7.4725 to 7.1605 (Falcon-7b perplexity forfp16is 7.1213).Some observations:
Q3_K_Mare not really viable for Falcon-7b.Q4/5_0/1are highly competitive with the k_quants when theoutput.weighttensor is quantized withQ8_0.-nommq) and the quantized implementation is much bigger compared the LLaMA models. For instance, forQ4_0,-nommqis 0.031 lower, which I think is not acceptable. In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable). Perhaps we should consider reverting CUDA: use mul_mat_q kernels by default #2683 so quantized matrix multiplications are opt-in rather than the default?The following graph shows perplexity scores for Falcon-7B for different quantization types using this PR. All calculations were run with
-nommq.