-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Replace CPU Function Calls with GPU Kernel Invocations #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Left some comments. As discussed offline, it'd be good to test each possible codepath and see if the behavior is expected, and measure performance for the canonical codepath. Thanks!
…for including sampling latency breakdown in response
CharlieFRuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you so much! Only two minor nits. Afterwards let's merge!
examples/get-started-latency-breakdown/src/get_started_latency_breakdown.ts
Outdated
Show resolved
Hide resolved
CharlieFRuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work! Let's follow up with on-GPU sampling!
…-ai#697) 1. Replace CPU function calls for the following tasks with GPU kernel invocations: - Apply logit bias - Apply penalties to logits - Compute softmax with temperature (sampling will be replaced in a future PR) 2. Fixed bug with repetition penalty not being used in generation config - Added repetition penalty to CompletionCreateParamsBase and ChatCompletionRequestBase interfaces - Updated definition in GenerationConfig and added reference in engine.ts 3. Added additional field in CompletionCreateParamsBase and ChatCompletionRequestBase interfaces to enable logging of time taken for individual steps 4. Added sanity checks for individual steps in sampleTokenFromLogits Performance Comparison: Compared performance for "canonical" flows averaged across 20 runs - No logit_bias - No logitProcessor - Applied penalties - With and without logprobs 1. Before PR performance (without logprobs): ~0.064s per output token (~15.63 decode tokens/s) 2. After PR performance (without logprobs): ~0.066s per output token (~15.15 decode tokens/s) 3. Before PR performance (with logprobs): ~0.052s per output token (~19.23 decode tokens/s) 5. After PR performance (without logprobs): ~0.048s per output token (~20.83 decode tokens/s) Additional Notes: - Need to profile performance of sampleTopPFromLogits vs sampleTopPFromProb on CPU to determine why performance with logprobs is better - Application of logit_bias is much faster on GPU than CPU - There are additional overheads outside of the sampleTokenFromLogits function that make the performance improvement less pronounced (the total time spent in sampleTokenFromLogits is ~0.0117s before the PR and ~0.0076s after the PR)
Performance Comparison: Compared performance for "canonical" flows averaged across 20 runs
Additional Notes: