Skip to content

Conversation

@JaxChen29
Copy link
Contributor

optimization on embedding forward for rocm:

  1. apply vec4 on embedding vbe forward kernel instead of vec2
  2. use preload to optimize vbe forward kernel
  3. As there are 64 threads in rocm, optimize subwarp in embedding forward v2 kernel when embedding dim is from 32 to 64.

@meta-cla meta-cla bot added the cla signed label Nov 12, 2025
@JaxChen29 JaxChen29 marked this pull request as draft November 12, 2025 13:13
@JaxChen29 JaxChen29 marked this pull request as ready for review November 12, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants