Replace CPU Function Calls with GPU Kernel Invocations #697

akaashrp · 2025-06-07T03:02:24Z

Replace CPU function calls for the following tasks with GPU kernel invocations:

Apply logit bias
Apply penalties to logits
Compute softmax with temperature (sampling will be replaced in a future PR)

Fixed bug with repetition penalty not being used in generation config

Added repetition penalty to CompletionCreateParamsBase and ChatCompletionRequestBase interfaces
Updated definition in GenerationConfig and added reference in engine.ts

Added additional field in CompletionCreateParamsBase and ChatCompletionRequestBase interfaces to enable logging of time taken for individual steps
Added sanity checks for individual steps in sampleTokenFromLogits

Performance Comparison: Compared performance for "canonical" flows averaged across 20 runs

No logit_bias
No logitProcessor
Applied penalties
With and without logprobs

Before PR performance (without logprobs): ~0.064s per output token (~15.63 decode tokens/s)
After PR performance (without logprobs): ~0.066s per output token (~15.15 decode tokens/s)
Before PR performance (with logprobs): ~0.052s per output token (~19.23 decode tokens/s)
After PR performance (without logprobs): ~0.048s per output token (~20.83 decode tokens/s)

Additional Notes:

Need to profile performance of sampleTopPFromLogits vs sampleTopPFromProb on CPU to determine why performance with logprobs is better
Application of logit_bias is much faster on GPU than CPU
There are additional overheads outside of the sampleTokenFromLogits function that make the performance improvement less pronounced (the total time spent in sampleTokenFromLogits is ~0.0117s before the PR and ~0.0076s after the PR)

CharlieFRuan

Overall looks good! Left some comments. As discussed offline, it'd be good to test each possible codepath and see if the behavior is expected, and measure performance for the canonical codepath. Thanks!

src/llm_chat.ts

…for including sampling latency breakdown in response

CharlieFRuan

This is great, thank you so much! Only two minor nits. Afterwards let's merge!

examples/get-started-latency-breakdown/src/get_started_latency_breakdown.ts

examples/sanity-checks/src/sanity_checks.ts

CharlieFRuan

LGTM, great work! Let's follow up with on-GPU sampling!

…-ai#697) 1. Replace CPU function calls for the following tasks with GPU kernel invocations: - Apply logit bias - Apply penalties to logits - Compute softmax with temperature (sampling will be replaced in a future PR) 2. Fixed bug with repetition penalty not being used in generation config - Added repetition penalty to CompletionCreateParamsBase and ChatCompletionRequestBase interfaces - Updated definition in GenerationConfig and added reference in engine.ts 3. Added additional field in CompletionCreateParamsBase and ChatCompletionRequestBase interfaces to enable logging of time taken for individual steps 4. Added sanity checks for individual steps in sampleTokenFromLogits Performance Comparison: Compared performance for "canonical" flows averaged across 20 runs - No logit_bias - No logitProcessor - Applied penalties - With and without logprobs 1. Before PR performance (without logprobs): ~0.064s per output token (~15.63 decode tokens/s) 2. After PR performance (without logprobs): ~0.066s per output token (~15.15 decode tokens/s) 3. Before PR performance (with logprobs): ~0.052s per output token (~19.23 decode tokens/s) 5. After PR performance (without logprobs): ~0.048s per output token (~20.83 decode tokens/s) Additional Notes: - Need to profile performance of sampleTopPFromLogits vs sampleTopPFromProb on CPU to determine why performance with logprobs is better - Application of logit_bias is much faster on GPU than CPU - There are additional overheads outside of the sampleTokenFromLogits function that make the performance improvement less pronounced (the total time spent in sampleTokenFromLogits is ~0.0117s before the PR and ~0.0076s after the PR)

akaashrp added 3 commits June 3, 2025 21:16

replace calls in sampleTokenFromLogits with GPU kernels

5fabf10

remove debugging code

2777057

remove unnecessary loops

dee9d19

CharlieFRuan mentioned this pull request Jul 15, 2025

Roadmap #707

Open

14 tasks

Remove unnecessary GPU-CPU copies and fix scope issues

10ed010

CharlieFRuan reviewed Aug 29, 2025

View reviewed changes

src/llm_chat.ts Show resolved Hide resolved

src/llm_chat.ts Outdated Show resolved Hide resolved

src/llm_chat.ts Outdated Show resolved Hide resolved

akaashrp added 5 commits September 5, 2025 19:08

Update comments and remove unnecessary control logic for token sampling

2f21df7

Fix bug with using provided repetition penalty value and add support …

7975304

…for including sampling latency breakdown in response

Update comments and add timing for sampling steps

b78e408

Add sanity checks and latency breakdown examples

8cc787e

Update documentation for repetition_penalty

d714736

CharlieFRuan reviewed Sep 12, 2025

View reviewed changes

examples/get-started-latency-breakdown/src/get_started_latency_breakdown.ts Outdated Show resolved Hide resolved

examples/sanity-checks/src/sanity_checks.ts Show resolved Hide resolved

akaashrp added 5 commits September 13, 2025 02:08

Clean up code and rename sanity-checks

391254a

Update sanity_checks package.json

aa2422d

Use Prettier code style

891dc78

Update .gitignore with package-lock and move sanity checks

ce95b07

Added tests-specific .gitignore

fa4c80f

CharlieFRuan approved these changes Sep 13, 2025

View reviewed changes

CharlieFRuan merged commit e4b4dc2 into mlc-ai:main Sep 13, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace CPU Function Calls with GPU Kernel Invocations #697

Replace CPU Function Calls with GPU Kernel Invocations #697

Uh oh!

akaashrp commented Jun 7, 2025 •

edited

Loading

Uh oh!

CharlieFRuan left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan left a comment

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace CPU Function Calls with GPU Kernel Invocations #697

Replace CPU Function Calls with GPU Kernel Invocations #697

Uh oh!

Conversation

akaashrp commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CharlieFRuan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akaashrp commented Jun 7, 2025 •

edited

Loading

CharlieFRuan left a comment •

edited

Loading