Skip to content

Conversation

@akaashrp
Copy link
Contributor

@akaashrp akaashrp commented Nov 2, 2025

Performance Comparison with v0.2.79: Compared performance for "canonical" flows averaged across 20 runs

  • No logit_bias
  • No logitProcessor
  • Applied frequency, presence, and repetition penalties
  • Use logprobs
  • No top_logprobs

v0.2.79 performance: ~38.17 decode tokens/s
Post-PR performance: ~38.99 decode tokens/s

Notes:

  1. The minimal performance improvement is likely due to kernel launch overheads. Specifically, we need to call three kernels to perform sampling (fsoftmaxWithTemperature, fargsortProbs, fSampleWithTopP).
  2. This will likely scale better for simultaneous sampling from multiple sequences.

@akaashrp akaashrp requested a review from CharlieFRuan November 2, 2025 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant