CUDA: fix FA occupancy, optimize tile kernel #15982

JohannesGaessler · 2025-09-14T08:57:13Z

This PR fixes a bug in the CUDA FlashAttention occupancy calculation. In rare cases too few kernels would be launched in parallel, leading to a few % less performance.

This PR also delivers what I think will be the last round of performance optimizations for the tile FA kernel: I revised the memory layout to consistently copy data in 8/16 byte chunks and delayed writing the KQ accumulators to shared memory after they have been compressed to FP16. I looked up the amount of shared memory on each AMD GPU and fit the tile sizes accordingly. One thing that could still be done is do the same GQA optimization as for the mma kernel but because the GPUs using the tile kernel are comparatively slower reducing the mask I/O has little impact; it could improve performance for small batch sizes > 1 though.

Performance changes

GPU	Model	Microbatch size	Test	t/s master	t/s `4a60861`	Speedup
MI60 / MI50	gemma 2B Q4_0	16	pp16384	715.37	723.22	1.01
MI60 / MI50	gemma 2B Q4_0	32	pp16384	911.50	922.04	1.01
MI60 / MI50	gemma 2B Q4_0	64	pp16384	1002.59	1037.99	1.04
MI60 / MI50	gemma 2B Q4_0	128	pp16384	1571.26	1632.54	1.04
MI60 / MI50	gemma 2B Q4_0	256	pp16384	1960.46	2104.29	1.07
MI60 / MI50	gemma 2B Q4_0	512	pp16384	2137.20	2309.96	1.08
MI60 / MI50	gemma 2B Q4_0	1024	pp16384	2282.16	2504.71	1.10
MI60 / MI50	gemma 2B Q4_0	2048	pp16384	2302.98	2529.14	1.10
MI60 / MI50	gemma 2B Q4_0	4096	pp16384	2264.73	2503.34	1.11
MI60 / MI50	gemma 2B Q4_0	8192	pp16384	2201.83	2421.21	1.10
MI60 / MI50	gemma 2B Q4_0	16384	pp16384	1982.16	2153.51	1.09
MI60 / MI50	llama 1B Q4_0	16	pp16384	997.74	1063.29	1.07
MI60 / MI50	llama 1B Q4_0	32	pp16384	1330.00	1330.98	1.00
MI60 / MI50	llama 1B Q4_0	64	pp16384	1503.46	1581.35	1.05
MI60 / MI50	llama 1B Q4_0	128	pp16384	2132.19	2290.14	1.07
MI60 / MI50	llama 1B Q4_0	256	pp16384	2611.92	2828.35	1.08
MI60 / MI50	llama 1B Q4_0	512	pp16384	2901.14	3190.54	1.10
MI60 / MI50	llama 1B Q4_0	1024	pp16384	3039.48	3322.31	1.09
MI60 / MI50	llama 1B Q4_0	2048	pp16384	3098.75	3421.87	1.10
MI60 / MI50	llama 1B Q4_0	4096	pp16384	3055.54	3380.00	1.11
MI60 / MI50	llama 1B Q4_0	8192	pp16384	2892.59	3216.54	1.11
MI60 / MI50	llama 1B Q4_0	16384	pp16384	2479.57	2711.34	1.09
MI60 / MI50	llama 8B Q4_0	16	pp16384	293.87	309.78	1.05
MI60 / MI50	llama 8B Q4_0	32	pp16384	367.71	397.84	1.08
MI60 / MI50	llama 8B Q4_0	64	pp16384	402.38	432.39	1.07
MI60 / MI50	llama 8B Q4_0	128	pp16384	499.53	552.83	1.11
MI60 / MI50	llama 8B Q4_0	256	pp16384	556.47	627.75	1.13
MI60 / MI50	llama 8B Q4_0	512	pp16384	601.43	687.65	1.14
MI60 / MI50	llama 8B Q4_0	1024	pp16384	547.37	703.20	1.28
MI60 / MI50	llama 8B Q4_0	2048	pp16384	461.60	707.96	1.53
MI60 / MI50	llama 8B Q4_0	4096	pp16384	421.46	707.79	1.68
MI60 / MI50	llama 8B Q4_0	8192	pp16384	407.40	702.20	1.72
MI60 / MI50	llama 8B Q4_0	16384	pp16384	390.19	682.75	1.75
RX 6800	gemma 2B Q4_0	16	pp16384	637.07	658.96	1.03
RX 6800	gemma 2B Q4_0	32	pp16384	993.77	1005.25	1.01
RX 6800	gemma 2B Q4_0	64	pp16384	1265.65	1281.32	1.01
RX 6800	gemma 2B Q4_0	128	pp16384	1516.82	1540.82	1.02
RX 6800	gemma 2B Q4_0	256	pp16384	1726.39	1752.65	1.02
RX 6800	gemma 2B Q4_0	512	pp16384	1900.37	1927.77	1.01
RX 6800	gemma 2B Q4_0	1024	pp16384	1962.85	1985.40	1.01
RX 6800	gemma 2B Q4_0	2048	pp16384	2007.33	2030.19	1.01
RX 6800	gemma 2B Q4_0	4096	pp16384	2026.98	2051.96	1.01
RX 6800	gemma 2B Q4_0	8192	pp16384	1979.52	2000.75	1.01
RX 6800	llama 1B Q4_0	16	pp16384	903.33	943.03	1.04
RX 6800	llama 1B Q4_0	32	pp16384	1338.84	1315.65	0.98
RX 6800	llama 1B Q4_0	64	pp16384	1668.50	1707.08	1.02
RX 6800	llama 1B Q4_0	128	pp16384	1976.71	2049.74	1.04
RX 6800	llama 1B Q4_0	256	pp16384	2197.42	2369.02	1.08
RX 6800	llama 1B Q4_0	512	pp16384	2305.52	2511.15	1.09
RX 6800	llama 1B Q4_0	1024	pp16384	2442.99	2606.40	1.07
RX 6800	llama 1B Q4_0	2048	pp16384	2475.44	2629.51	1.06
RX 6800	llama 1B Q4_0	4096	pp16384	2469.49	2637.45	1.07
RX 6800	llama 1B Q4_0	8192	pp16384	2370.86	2493.89	1.05
RX 6800	llama 1B Q4_0	16384	pp16384	2076.30	2176.14	1.05
RX 6800	llama 8B Q4_0	16	pp16384	234.63	243.61	1.04
RX 6800	llama 8B Q4_0	32	pp16384	328.40	351.80	1.07
RX 6800	llama 8B Q4_0	64	pp16384	385.34	431.51	1.12
RX 6800	llama 8B Q4_0	128	pp16384	462.65	514.19	1.11
RX 6800	llama 8B Q4_0	256	pp16384	509.90	580.30	1.14
RX 6800	llama 8B Q4_0	512	pp16384	532.11	613.16	1.15
RX 6800	llama 8B Q4_0	1024	pp16384	536.30	622.59	1.16
RX 6800	llama 8B Q4_0	2048	pp16384	526.41	636.03	1.21
RX 6800	llama 8B Q4_0	4096	pp16384	520.42	637.78	1.23
RX 6800	llama 8B Q4_0	8192	pp16384	514.03	632.39	1.23
P40	gemma 2B Q4_0	16	pp16384	797.54	834.28	1.05
P40	gemma 2B Q4_0	32	pp16384	1134.80	1179.20	1.04
P40	gemma 2B Q4_0	64	pp16384	1348.98	1421.51	1.05
P40	gemma 2B Q4_0	128	pp16384	1469.82	1565.21	1.06
P40	gemma 2B Q4_0	256	pp16384	1555.06	1669.10	1.07
P40	gemma 2B Q4_0	512	pp16384	1600.38	1716.29	1.07
P40	gemma 2B Q4_0	1024	pp16384	1663.95	1792.98	1.08
P40	gemma 2B Q4_0	2048	pp16384	1663.02	1832.83	1.10
P40	gemma 2B Q4_0	4096	pp16384	1671.74	1834.08	1.10
P40	gemma 2B Q4_0	8192	pp16384	1632.67	1793.12	1.10
P40	gemma 2B Q4_0	16384	pp16384	1499.38	1637.84	1.09
P40	llama 1B Q4_0	16	pp16384	1219.33	1211.62	0.99
P40	llama 1B Q4_0	32	pp16384	1712.42	1746.36	1.02
P40	llama 1B Q4_0	64	pp16384	2017.76	2056.77	1.02
P40	llama 1B Q4_0	128	pp16384	2230.04	2277.90	1.02
P40	llama 1B Q4_0	256	pp16384	2434.15	2490.01	1.02
P40	llama 1B Q4_0	512	pp16384	2495.92	2550.39	1.02
P40	llama 1B Q4_0	1024	pp16384	2572.93	2660.26	1.03
P40	llama 1B Q4_0	2048	pp16384	2622.93	2689.04	1.03
P40	llama 1B Q4_0	4096	pp16384	2614.92	2676.44	1.02
P40	llama 1B Q4_0	8192	pp16384	2528.01	2584.50	1.02
P40	llama 1B Q4_0	16384	pp16384	2224.16	2272.10	1.02
P40	llama 8B Q4_0	16	pp16384	295.74	299.10	1.01
P40	llama 8B Q4_0	32	pp16384	357.08	363.73	1.02
P40	llama 8B Q4_0	64	pp16384	423.12	432.09	1.02
P40	llama 8B Q4_0	128	pp16384	458.12	466.13	1.02
P40	llama 8B Q4_0	256	pp16384	490.97	498.69	1.02
P40	llama 8B Q4_0	512	pp16384	501.94	510.59	1.02
P40	llama 8B Q4_0	1024	pp16384	513.06	524.31	1.02
P40	llama 8B Q4_0	2048	pp16384	519.36	530.47	1.02
P40	llama 8B Q4_0	4096	pp16384	517.18	527.73	1.02
P40	llama 8B Q4_0	8192	pp16384	514.79	524.44	1.02
P40	llama 8B Q4_0	16384	pp16384	502.04	511.36	1.02

IMbackK · 2025-09-15T21:48:40Z

I'll take a proper look at this, but will not be able to do so until the 17th

JohannesGaessler · 2025-09-15T22:05:45Z

Since you're already here, do you have an opinion on whether the HIP backend should be compiled with -ffast-math?

IMbackK · 2025-09-17T09:17:44Z

A quick grep suggests we use inf directly (see softmax), so blanket ffast-math is out. we could use some of the ffast-math flags or use ffast-math on a per function or per translation unit basis, but im not sure its worth it. In the past on other code llvm fast-math has made things slower on hip for some reason, and before rocm 6.1 there where some bugs i have encountered where fast-math just generated plain wrong code.

IMbackK

Changes look fine to me, i can also confirm the performance delta on gfx1030. When making gfx908 use this code path i cant reproduce the same magnitude of performance improvement as @JohannesGaessler dose on gfx906, but find no regression. I noticed that this pr reduced the amount of spilled vgprs (altho some instances still spill like _ZL15flash_attn_tileILi64ELi32ELb0EEvPKcS1_S1_S1_S1_PKiPfP15HIP_vector_typeIfLj2EEffffjfiiiiiiiiiiiiiliiliiiiil) so its possible that some of the extra improvement on gfx906 comes from reduced spills to scratch, where gfx908 can spill to agprs which has a lower performance impact.

Side note:
some of the vector fattn kernels spill to high heaven:

Function Name: _ZL22flash_attn_vec_ext_f32ILi128ELi8EL9ggml_type8ELS0_8ELb1EEvPKcS2_S2_S2_S2_PKiPfP15HIP_vector_typeIfLj2EEffffjfiiiiiiiiiiiiiliiliiiiil
     TotalSGPRs: 88
     VGPRs: 63
     AGPRs: 64
     ScratchSize [bytes/lane]: 1100
     Dynamic Stack: False
     Occupancy [waves/SIMD]: 4
     SGPRs Spill: 0
     VGPRs Spill: 298
     LDS Size [bytes/block]: 10240

thats 362 vector registers spilled in this kernel, as the AGPRs are also spills in this case.

JohannesGaessler · 2025-09-17T15:10:15Z

One of my current efforts is to make the kernel parameters more configurable as a function of hardware. I intend to soon procure an RDNA4 GPU so that I can implement support for the AMD WMMA instructions in the mma FA kernel. In principle, if the mma kernel can be made to work it should perform best since you need to hold fewer registers than the tile kernel and unlike the WMMA kernel you don't have to go through shared memory. Can you give me a list of the AMD hardware that you have so that I can adjust my purchases for wider coverage?

IMbackK · 2025-09-17T16:10:16Z

Sure, i have gfx803 (Fiji / GCN3), gfx900 (Vega APU / GCN5), gfx906 (MI50 / GCN5.1), gfx908 (MI100 / CDNA1), gfx1030 (RX6800XT / RDNA2).

I dont have any WMMA device at all, so any device with WMMA instructions would be very helpful. I know you dont intend to buy anything for actual use but from a practical perspective the large register file RDNA3 gpus (7900xtx, 7900xt, 7800xt) tend to be better for ai inference than RDNA4, just on account of being bigger devices with more CUs, vram and bandwith.

broadbit-hu · 2025-09-17T18:00:46Z

@IMbackK I have 7800xt, 7900xt, 7900xtx cards, how can I help you?

IMbackK · 2025-09-17T18:12:25Z

@broadbit-hu not atm. For regression testing it is useful to have people around who regularly run llamacpp on a given arch. But we where talking about doing feature development. When doing feature development the dev in question really needs to have the device with the instructions to be implemented on hand in one of his machines.

This reverts commit c959b67.

@ikawrakow

)" This reverts commit 75a3a6c. d Update cudart64_12.dll Revert "Cudart 12.9" This reverts commit f79c687. Revert "Allow compile exe, pdf features off" This reverts commit 5e1c154. Update fattn.cu Update set-rows.cu batches Revert "try fix fattn again, porting some older code. the cc detection is not working well, so its hacky" This reverts commit 7b04191. Update ggml-cuda.cu Update fattn.cu Update fattn.cu Update fattn.cu Add option to disable MMA support on Turing Author : pt13762104 GGML_CUDA_NO_PEER_COPY to try to fix a crash on Gemma 3 Deactivate SWA when Fast Forwarding, commented Wrench Fix for the SWA I borked Clean-up quantkv algo comment warp sizes for now in IQ_K MMQ Kernels KV 24 -> KV 31 Add a readme. ngxson's commented hack Try some hack for gpt-oss Update llama-vocab.cpp Bump Windows max open files from 512 to 2048 Author : Thireus CLI - Specify GGML_TYPE to quantize for the main tensors. (#91) To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up EsoCroK naming v1.99430_b6645-6_Q6-IO2346_RMv1.17.99m Disable I2_K cpu quantization. To allow compilation. MMQ code adaptation Update mmq.cuh MMQ Initial code for IQ2,3,4,5,6_K IQ_K quants first gen (4, 5, 6) Some logs back Batches Croco Bench. Double the anti-abuse limits Allow compile exe, pdf features off Revert "Allow compile exe, pdf features off" This reverts commit 5e2451f129f0bca326f74aae24df475c0410cdbf. Update koboldcpp.py Revert "Allow compile exe, pdf features off" This reverts commit 2a7e9e004e8578a05fb67967d09cf36263867b9b. Revert "Allow compile exe, pdf features off" This reverts commit b4fd7809a4f77ff18bd415fcfb2d5f435e3b63a3. quantization tweaks iq3_ks quantization tweaks Minor iq3_k tweak q2_K tweaks q3_K tweaks q4_K tweaks q5_K tweaks GGUF v14 attempt of second fix. loosen gguf restrictions. Quantization improvements #295 and #302, GGML part only Improved IQ2_XS quantization #312 Improved IQ1_M quantization #327 ggml_row_size accounting fix for GGUF v14 Credits : @ikawrakow Fighting with cmake #279 Drop the GGML count limitation limit Old markings Customize KCPP.py Croco additional chat adapters andtemplates Reinstate "skip barrier of noop" Allow q8_0 KV cache for head size 256 #330 Up FA KV modes 256 candidates (1024 with Grammar) Adapt q6_0 MMQ to llama.cpp mainline Q6_0 MMQ Kernel attempt MMQ for Q6_0 authored by Ikawrakow Add Q6_0 MMQ to template generator authored by Ikawrakow Q6_0 KVQ for KCPP/Croco -> KV22 For release. fix a few lazy-cuts and hiccups left during the merge of IQ4_NL. dequantize for q6_0 and related cpy Enable q6_0 for flash attention As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache. PR by Ikawrakow on ik_llama.cpp Adding Q6_0 (#77) Rev 20240807 * Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal -> skipped. --------- Adaptation to mainline by me! IQ4_NL KVQ for KCPP/Croco missing templates instances for KVQ IQ4_NL Update fattn.cu for KVQ IQ4_NL Update fattn-vec-f16.cuh for KVQ IQ4_NL Update fattn-vec-f32.cuh for KVQ IQ4_NL CML and Makefile FOR IQ4_NL KV_IQ4_NL uncommenting VEC16 cases KV_IQ4_NL uncommenting VEC32 cases Enable IQ4_NL for V-cache in token generation Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. Comment unwanted add-in in makefile iq4_nl: faster quantization (#76) CUDA: faster float -> iq4_nl conversion (#73) * iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. Default Blas Batch Size = 128 Quant KV and Draft QKV, 24 modes With customizable QKV for the draft as well. And reduced Blas Batch Size for the draft model. Default Draft Amount = 4 Bench context size Max contextsize and steps Croco CML SCHED_MAX_COPIES = 1 And Croco usual additions to the CMakeList Cudart 12.9 Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)" This reverts commit 79bc429. Revert "HIP: use v_dot2_f32_f16 instruction for FA (ggml-org#15884)" This reverts commit 17bc5a8. Revert "CUDA: larger SRAM reads for tile FA, AMD FP16 dot (ggml-org#15927)" This reverts commit 0e6ff00. Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)" This reverts commit c959b67. Revert "CUDA: fix compilation on CC 6.0 (ggml-org#16091)" This reverts commit 368560a. Co-Authored-By: Kawrakow <[email protected]> Co-Authored-By: Iwan Kawrakow <[email protected]>

This reverts commit c959b67.

CUDA: fix FA occupancy, optimize tile kernel

3e4ff1f

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 14, 2025

IMbackK approved these changes Sep 17, 2025

View reviewed changes

JohannesGaessler merged commit c959b67 into ggml-org:master Sep 17, 2025
46 of 48 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 19, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

24ff459

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 23, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

a300867

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 24, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

aff1652

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 25, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

52900b9

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 25, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

f10be5d

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 26, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

279f08f

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 27, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

94f3e04

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 29, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

5894748

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 30, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

2a29d7f

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 1, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

75197f6

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

aa52a7c

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

1d971a6

This reverts commit c959b67.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 3, 2025

Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)"

1ff809b

This reverts commit c959b67.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix FA occupancy, optimize tile kernel #15982

CUDA: fix FA occupancy, optimize tile kernel #15982

Uh oh!

JohannesGaessler commented Sep 14, 2025 •

edited

Loading

Uh oh!

IMbackK commented Sep 15, 2025

Uh oh!

JohannesGaessler commented Sep 15, 2025

Uh oh!

IMbackK commented Sep 17, 2025 •

edited

Loading

Uh oh!

IMbackK left a comment

Uh oh!

Uh oh!

JohannesGaessler commented Sep 17, 2025

Uh oh!

IMbackK commented Sep 17, 2025 •

edited

Loading

Uh oh!

broadbit-hu commented Sep 17, 2025

Uh oh!

IMbackK commented Sep 17, 2025

Uh oh!

Uh oh!

CUDA: fix FA occupancy, optimize tile kernel #15982

CUDA: fix FA occupancy, optimize tile kernel #15982

Uh oh!

Conversation

JohannesGaessler commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Sep 15, 2025

Uh oh!

JohannesGaessler commented Sep 15, 2025

Uh oh!

IMbackK commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler commented Sep 17, 2025

Uh oh!

IMbackK commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

broadbit-hu commented Sep 17, 2025

Uh oh!

IMbackK commented Sep 17, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Sep 14, 2025 •

edited

Loading

IMbackK commented Sep 17, 2025 •

edited

Loading

IMbackK commented Sep 17, 2025 •

edited

Loading