Skip to content

Conversation

ggerganov
Copy link
Member

target #16148

Similar optimization as in #14924:

Before running the FA kernel, run a quick pass over the mask to find all -INF blocks and mark them in a fleeting buffer. The FA kernel then checks that buffer to determine if it needs to process a block.

Also unroll some loops better.

Most gains observed for larger head sizes and bigger contexts.

Model Test t/s master t/s gg/metal-fa-opt Speedup
gemma3 1B Q4_0 pp512 11103.89 11193.33 1.01
gemma3 1B Q4_0 pp2048 11458.31 11591.20 1.01
gemma3 1B Q4_0 pp4096 11646.47 11786.38 1.01
gemma3 1B Q4_0 pp8192 11424.17 11586.18 1.01
gemma3 1B Q4_0 pp16384 10162.08 10928.36 1.08
gemma3 270M Q4_0 pp512 37852.83 38711.10 1.02
gemma3 270M Q4_0 pp2048 41370.85 42241.54 1.02
gemma3 270M Q4_0 pp4096 43306.58 44600.97 1.03
gemma3 270M Q4_0 pp8192 40109.63 41961.18 1.05
gemma3 270M Q4_0 pp16384 34145.22 36322.50 1.06
gemma3 4B Q4_0 pp512 2795.98 2797.77 1.00
gemma3 4B Q4_0 pp2048 2659.50 2969.56 1.12
gemma3 4B Q4_0 pp4096 2551.27 2950.20 1.16
gemma3 4B Q4_0 pp8192 2516.94 2900.49 1.15
gemma3 4B Q4_0 pp16384 2481.45 2777.20 1.12
gpt-oss 20B MXFP4 MoE pp512 2429.89 2439.77 1.00
gpt-oss 20B MXFP4 MoE pp2048 2764.66 2803.28 1.01
gpt-oss 20B MXFP4 MoE pp4096 2674.08 2731.79 1.02
gpt-oss 20B MXFP4 MoE pp8192 2480.89 2563.84 1.03
gpt-oss 20B MXFP4 MoE pp16384 2150.29 2259.51 1.05
qwen2 3B Q4_0 pp512 3206.65 3203.57 1.00
qwen2 3B Q4_0 pp2048 3356.57 3365.75 1.00
qwen2 3B Q4_0 pp4096 3093.21 3174.24 1.03
qwen2 3B Q4_0 pp8192 2711.37 2821.38 1.04
qwen2 3B Q4_0 pp16384 2241.79 2276.75 1.02
qwen2 7B Q8_0 pp512 1531.82 1533.04 1.00
qwen2 7B Q8_0 pp2048 1583.26 1586.74 1.00
qwen2 7B Q8_0 pp4096 1522.09 1527.90 1.00
qwen2 7B Q8_0 pp8192 1407.64 1415.64 1.01
qwen2 7B Q8_0 pp16384 1088.01 1226.08 1.13
qwen3 0.6B Q8_0 pp512 14391.06 14314.24 0.99
qwen3 0.6B Q8_0 pp2048 13826.28 14158.68 1.02
qwen3 0.6B Q8_0 pp4096 11400.54 11682.70 1.02
qwen3 0.6B Q8_0 pp8192 8001.53 8544.94 1.07
qwen3 0.6B Q8_0 pp16384 5162.65 5460.53 1.06
qwen3moe 30B.A3B Q4_0 pp512 2194.70 2194.62 1.00
qwen3moe 30B.A3B Q4_0 pp2048 2507.55 2528.96 1.01
qwen3moe 30B.A3B Q4_0 pp4096 2225.35 2250.60 1.01
qwen3moe 30B.A3B Q4_0 pp8192 1808.99 1831.51 1.01
qwen3moe 30B.A3B Q4_0 pp16384 1300.71 1321.72 1.02

@ggerganov ggerganov requested a review from slaren as a code owner October 1, 2025 14:55
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 1, 2025
@jeffbolznv
Copy link
Collaborator

Is it possible to add backend tests that exercise this optimization?

@ggerganov
Copy link
Member Author

ggerganov commented Oct 1, 2025

This patch should exercise it, but it's currently very slow:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 64f1197dc..54e16bf8f 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -131,6 +131,51 @@ static void init_tensor_uniform(ggml_tensor * tensor, float min = -1.0f, float m
     }
 }
 
+static void init_tensor_kq_mask(ggml_tensor * tensor, float min = -1.0f, float max = 1.0f) {
+    GGML_ASSERT(tensor->type == GGML_TYPE_F16);
+
+    GGML_TENSOR_LOCALS( int32_t, ne, tensor, ne);
+    GGML_TENSOR_LOCALS(uint64_t, nb, tensor, nb);
+
+    std::vector<float>       data_f32(ne0*ne1*ne2*ne3);
+    std::vector<ggml_fp16_t> data_f16(ne0*ne1*ne2*ne3);
+
+    std::random_device rd;
+    std::mt19937 gen(rd());
+    std::uniform_real_distribution<float> dis(min, max);
+
+    // fill data_f32 with random floats in [-1.0, 1.0f]
+    for (size_t i = 0; i < data_f32.size(); i++) {
+        data_f32[i] = dis(gen);
+    }
+
+    const int blck_w = 128;
+    const int blck_h = 16;
+
+    // fill roughly half of the mask with -INFINITY
+    const int n_inf_blocks = 0.5*(ne0*ne1*ne2*ne3)/(blck_w*blck_h);
+
+    // choose random block position
+    for (int b = 0; b < n_inf_blocks; b++) {
+        const int i3 = (rd() % ne3);
+        const int i2 = (rd() % ne2);
+        const int i1 = (rd() % ne1);
+        const int i0 = (rd() % ne0);
+
+        for (int y = 0; y < blck_h && i1 + y < ne1; y++) {
+            for (int x = 0; x < blck_w && i0 + x < ne0; x++) {
+                const int i = i3*ne2*ne1*ne0 + i2*ne1*ne0 + (i1 + y)*ne0 + (i0 + x);
+
+                data_f32[i] = -INFINITY;
+            }
+        }
+    }
+
+    ggml_fp32_to_fp16_row(data_f32.data(), data_f16.data(), ne0*ne1*ne2*ne3);
+
+    ggml_backend_tensor_set(tensor, data_f16.data(), 0, data_f16.size()*sizeof(ggml_fp16_t));
+}
+
 static std::vector<float> tensor_to_float(const ggml_tensor * t) {
     std::vector<float> tv;
     tv.reserve(ggml_nelements(t));
@@ -5104,6 +5149,8 @@ struct test_flash_attn_ext : public test_case {
             if (strcmp(t->name, "s") == 0) {
                 // make the sink values more noticable in order to trigger a test failure when the implementation is wrong
                 init_tensor_uniform(t, -10.0f, 10.0f);
+            } else if (strcmp(t->name, "m") == 0) {
+                init_tensor_kq_mask(t);
             } else {
                 init_tensor_uniform(t);
             }

I'll try to optimize it tomorrow.

Edit: should be ok now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants