Skip to content

Conversation

dmahurin
Copy link
Contributor

Support for TQ2_0 on Metal.

This a commit by @compilade from last year, re-applied to current.

Run with:

llama-cli -m "$(huggingface-cli download basavyr/TriLM_3.9B_Unpacked_quantized TriLM_3.9B_Unpacked_quant_TQ2_0.gguf)" -p The

llama-cli -m "$(huggingface-cli download brunopio/Llama3-8B-1.58-100B-tokens-GGUF Llama3-8B-1.58-100B-tokens-TQ2_0.gguf)"

The result runs and the result seems similar to that of the original commit by @compilade.

Though the result is not great compared to 4bit Llama 8B. Perhaps someone can compare with non-metal result.

compilade and others added 2 commits March 20, 2025 14:24
Mostly adapted from the IQ2_TN kernels
from ikawrakow/ik_llama.cpp#13
which were themselves adapted from the Q2_K kernels.
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 20, 2025
@ggerganov
Copy link
Member

Few updates by just pattern matching with the Q2_K kernel:

diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index 8ac60744..a068e84c 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -5075,15 +5075,15 @@ void kernel_mul_mv_tq2_0_f32_impl(
     const int im = tgpig.z;
 
     const int first_row = (r0 * N_SIMDGROUP + sgitg) * N_DST;
-    const int ib_row = first_row * nb;
 
     const uint i12 = im%args.ne12;
     const uint i13 = im/args.ne12;
 
-    const uint offset0 = (i12/args.r2)*(nb*args.ne01) + (i13/args.r3)*(nb*args.ne01*args.ne02);
+    const uint64_t offset0 = first_row*args.nb01 + (i12/args.r2)*args.nb02 + (i13/args.r3)*args.nb03;
+    const uint64_t offset1 =        r1*args.nb11 + (i12        )*args.nb12 + (i13        )*args.nb13;
 
-    device const block_tq2_0 * x = (device const block_tq2_0 *) src0 + ib_row + offset0;
-    device const float       * y = (device const float       *) src1 + r1*args.ne10 + im*args.ne00*args.ne1;
+    device const block_tq2_0 * x = (device const block_tq2_0 *) (src0 + offset0);
+    device const float       * y = (device const float       *) (src1 + offset1);
 
     float yl[32];
     float sumf[N_DST]={0.f}, all_sum;
@@ -5139,7 +5139,7 @@ void kernel_mul_mv_tq2_0_f32_impl(
 
     device float * dst_f32 = (device float *) dst + (uint64_t)im*args.ne0*args.ne1 + (uint64_t)r1*args.ne0;
 
-    for (int row = 0; row < N_DST; ++row) {
+    for (int row = 0; row < N_DST && first_row + row < args.ne0; ++row) {
         all_sum = simd_sum(sumf[row]);
         if (tiisg == 0) {
             dst_f32[first_row + row] = all_sum;

Haven't performed any tests so double-check if this makes sense.

@ikawrakow
Copy link
Contributor

This a commit by @compilade from last year, re-applied to current.

The Metal implementation actually came from here. See this comment by @compilade.

@dmahurin
Copy link
Contributor Author

dmahurin commented Mar 30, 2025

Hi @ggerganov

I updated the branch with your changes.
The changes compile and run. I could not quite tell if result is any worse or equivalent.

While I understand replacing ib_row with first_row * nb, and the the bounds check,
there are other changes that are less obvious (to me), that you could perhaps explain.

-    const uint64_t offset0 = first_row * nb + (i12/args.r2)*(nb*args.ne01) + (i13/args.r3)*(nb*args.ne01*args.ne02);
-    const uint64_t offset1 = r1*args.ne10 + im*args.ne00*args.ne1;
+    const uint64_t offset0 = first_row*args.nb01 + (i12/args.r2)*args.nb02 + (i13/args.r3)*args.nb03;
+    const uint64_t offset1 =        r1*args.nb11 + (i12        )*args.nb12 + (i13        )*args.nb13;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants