Skip to content

Conversation

junjihashimoto
Copy link
Collaborator

@junjihashimoto junjihashimoto commented Sep 5, 2025

Add subgroup matrix multiplication.
The kernel can be executed with the subgroupMatrixMultiplication function, but the results are incorrect. I am currently debugging.

> sysctl -n machdep.cpu.brand_string
Apple M4 Max
> MATMUL_VERSION=12 ./build/matmul  | grep -A 2 'Dispatching\|Exec'
matmul(50174,0x2016ee140) malloc: nano zone abandoned due to inability to reserve vm space.
[info] Dispatching Kernel version 12: f16: Subgroup matrix multiply with transpose, 30 iterations ...
[info] Copying result to CPU
[info]
--
Execution Time: (M = 4096, K = 4096, N = 8192) x 30 iterations :
40.0 milliseconds / dispatch ~ 6871.66 GFLOPS
================================================================================
> MATMUL_VERSION=11 ./build/matmul  | grep -A 2 'Dispatching\|Exec'
matmul(13932,0x2016ee140) malloc: nano zone abandoned due to inability to reserve vm space.
[info] Dispatching Kernel version 11: f16: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
[info] Copying result to CPU
[info]
--
Execution Time: (M = 4096, K = 4096, N = 8192) x 30 iterations :
26.6 milliseconds / dispatch ~ 10316.68 GFLOPS ## This is the result not using the subgroupMatrixMultiplication function.
================================================================================

@junjihashimoto
Copy link
Collaborator Author

The main branch does not seem to output any shader compilation errors.

@junjihashimoto junjihashimoto changed the base branch from main to dev September 22, 2025 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant