[Kernels] Migrate sampling to WebGPU #737

akaashrp · 2025-11-02T05:57:42Z

Performance Comparison with v0.2.79: Compared performance for "canonical" flows averaged across 20 runs

v0.2.79 performance: ~38.17 decode tokens/s
Post-PR performance: ~38.99 decode tokens/s

Notes:

The minimal performance improvement is likely due to kernel launch overheads. Specifically, we need to call three kernels to perform sampling (fsoftmaxWithTemperature, fargsortProbs, fSampleWithTopP).
This will likely scale better for simultaneous sampling from multiple sequences.

[Kernels] Migrate sampling to WebGPU

ee5a212

akaashrp requested a review from CharlieFRuan November 2, 2025 05:57

Provide feedback