[None][fix] Use fp32 for indexer weight_proj GEMM #9243

chang-l · 2025-11-18T02:41:58Z

This PR should depend on #9232 and updates the indexer.weight_proj to use FP32 GEMM, matching the reference implementation. Our initial perf/accuracy study shows that promoting weight_proj to FP32 can improve accuracy, but also introduces expected performance regression.

Accuracy experiments indicate that FP32 weight_proj may stabilize topk and improves model accuracy. However, enabling this means:

weight_proj can no longer be fused for nvfp4 path.
weight_proj could decompose to three kernels (copy_to_fp32 → fp32_gemm → splitKreduce), which adds overhead.

Accuracy Impact (based on #9232)

AIME25 pass@1
--------------------------------
With this PR:     85.83% +/- 3.19%
Without this PR:  83.33% +/- 2.72%

Performance Impact(config: DSV3.2-NVFP4; ISL/OSL=8k/1k; DEP=8; MTP=1; concurrency=64)

	FP8	nvFP4
Before this PR (tps/gpu)	214.3	307.2
After this PR (tps/gpu)	223.01	297.2

Summary by CodeRabbit

Refactor
- Streamlined sparse attention backend weight computation with consolidated helper functions.
- Simplified attention configuration by unifying fused indexer flags.
- Optimized internal weight handling to use FP8 quantization paths.
- Reduced forward pass parameters by internalizing weight projection logic.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt_llm/_torch/models/modeling_deepseekv3.py

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

coderabbitai · 2025-11-18T22:17:57Z

📝 Walkthrough

Walkthrough

The changes refactor the DSA indexer implementation to consolidate Q/K projection and preparation logic into internal helper methods, introduce FP32 float conversion, modify weight projection parameters, and remove indexer_weights from forward signatures. Related changes rename a fuse flag and remove indexer_weights outputs across dependent attention modules.

Changes

Cohort / File(s)	Summary
DSA Indexer Refactoring `tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	Added module-level `_to_float()` function for FP32 conversion. Introduced internal helper methods: `_weight_scale()`, `_qk_projection_and_rope()`, and `_prep_q_or_k()`. Modified `Indexer.__init__` to use `dtype=torch.float32` and `use_custom_cublas_mm=False` in weights_proj creation. Updated `forward()` signature to remove `indexer_weights` parameter and consolidate Q/K projection logic via new helpers.
Weight Handling and Flag Rename `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Renamed fuse flag from `fuse_a_indexer_k_weight` to `fuse_a_indexer_k`. Adjusted dimension calculation in `kv_a_proj_with_mqa` construction to use only `indexer.head_dim` (removed `+ indexer.n_heads`). Simplified `post_load_weights()` to remove indexer.weights_proj handling and consolidate fused weight loading.
MLA Attention Output Signature `tensorrt_llm/_torch/modules/attention.py`	Renamed internal fuse flag from `fuse_a_indexer_k_weight` to `fuse_a_indexer_k`. Modified `kv_a_proj_with_mqa` return signature from 5 values to 4 values (removed `indexer_weights`). Updated `forward_impl_with_dsa` to no longer produce or pass `indexer_weights` to indexer call.

Sequence Diagram

sequenceDiagram
    participant forward as forward()
    participant float_conv as _to_float()
    participant qk_rope as _qk_projection_and_rope()
    participant prep_q as _prep_q_or_k()
    participant prep_k as _prep_q_or_k()
    participant weight_scale as _weight_scale()
    participant cache as _update_k_cache()

    forward->>float_conv: convert inputs to FP32
    float_conv-->>forward: hidden_states (FP32)
    
    forward->>qk_rope: qr, hidden_states, indexer_k, position_ids
    Note over qk_rope: Q/K projection<br/>K selection<br/>RoPE split & rotation
    qk_rope-->>forward: q_pe, q_nope, k_pe, k_nope
    
    par Prepare Q and K in parallel
        forward->>prep_q: q_pe, q_nope
        Note over prep_q: Concat, activate<br/>reshape, FP8 quantize
        prep_q-->>forward: q_prepared
        
        forward->>prep_k: k_pe, k_nope
        Note over prep_k: Concat, activate<br/>reshape, FP8 quantize
        prep_k-->>forward: k_prepared
    end
    
    forward->>weight_scale: weights, q_scale
    Note over weight_scale: Centralized weight<br/>scaling logic
    weight_scale-->>forward: scaled_weights
    
    forward->>cache: update K cache
    cache-->>forward: ✓ Cache updated

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

New helper methods in dsa.py: Verify logic consolidation in _qk_projection_and_rope() and _prep_q_or_k() matches original inline behavior, especially RoPE handling and FP8 quantization ordering.
Dimension changes in modeling_deepseekv3.py: Confirm removal of + indexer.n_heads from kv_a_proj_with_mqa dimension calculation is intentional and correct, and validate weight loading/consolidation logic.
Signature propagation: Ensure indexer_weights removal is consistently handled across all call sites in both attention.py and modeling_deepseekv3.py.
dtype changes: Verify that forcing dtype=torch.float32 in weights_proj and disabling use_custom_cublas_mm do not introduce performance or correctness regressions.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly summarizes the main change: converting indexer weight_proj to use FP32 GEMM instead of lower precision, matching the stated objective.
Description check	✅ Passed	The PR description provides a clear summary of the change (promoting weight_proj to FP32), includes detailed accuracy and performance impact data, and references the dependency PR (#9232).

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

356-369: Clarify kv_a_proj_with_mqa comment now that only indexer.wk is fused

The comment still says kv_a_proj_with_mqa is oversized “to include indexer weights”, but after this PR only indexer.wk is fused; indexer.weights_proj stays separate. Consider rewording to avoid implying the weight_proj GEMM is also fused into this module.

559-561: Fused indexer K path looks consistent; consider guarding fuse flag usage

The new kv_a_proj_with_mqa out_features (kv_lora_rank + qk_rope_head_dim + q_lora_rank + indexer.head_dim) matches the split in forward_impl_with_dsa and the offset used in post_load_weights, so the indexer_k slice and the indexer.wk copy line up correctly.

The dtype assertion before copying indexer.wk into the fused module is a good safety check and aligns with the new FP32-only weights_proj path.

One small robustness tweak: forward_impl_with_dsa in MLA assumes self.fuse_a_indexer_k exists. If any non‑DeepseekV32 MLA instance ever runs with DSA enabled, using getattr(self, "fuse_a_indexer_k", False) (or defaulting the flag in MLA.__init__) would avoid a potential AttributeError.

Also applies to: 584-591, 598-607

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36d3d8f and 578565c.

📒 Files selected for processing (3)

tensorrt_llm/_torch/attention_backend/sparse/dsa.py (3 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (4 hunks)
tensorrt_llm/_torch/modules/attention.py (1 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

📚 Learning: 2025-08-14T06:36:40.701Z

Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/attention_backend/sparse/dsa.py

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/attention_backend/sparse/dsa.py

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/modules/attention.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

tensorrt_llm/_torch/modules/attention.py

📚 Learning: 2025-08-19T12:45:35.429Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:2086-2092
Timestamp: 2025-08-19T12:45:35.429Z
Learning: DoRA (Delta Orthogonal Rank Adaptation) functionality has been removed from the PyTorch flow in tensorrt_llm/_torch/pyexecutor/model_engine.py. The is_dora field is computed but not used downstream in the PyTorch flow, so converting it to a tensor would be wasteful overhead.

Applied to files:

tensorrt_llm/_torch/attention_backend/sparse/dsa.py

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

tensorrt_llm/_torch/attention_backend/sparse/dsa.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/attention_backend/sparse/dsa.py (3)

tensorrt_llm/_torch/utils.py (2)

maybe_compile (344-365)

_ (217-223)

tensorrt_llm/quantization/utils/fp8_utils.py (1)

fp8_quantize_1x128_sf_transpose (523-533)

tensorrt_llm/_torch/modules/multi_stream_utils.py (1)

maybe_execute_in_parallel (35-74)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/_torch/modules/attention.py (1)

1220-1236: DSA fused indexer_k split matches fused projection sizing

The new split([self.q_lora_rank, self.kv_lora_rank, self.qk_rope_head_dim, self.indexer.head_dim], -1) for the fused path aligns with the updated kv_a_proj_with_mqa out_features and with the offset used when copying indexer.wk into the fused weights. The fallback path (no fusion) remains unchanged aside from passing indexer_k=None, so behavior for non‑fused configurations is preserved.

tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)

677-680: FP32 weights_proj path and Q/K helper refactor look consistent

This refactor cleanly separates concerns:

weights_proj is now explicitly FP32, unquantized, and uses standard GEMM (use_custom_cublas_mm=False), with _to_float ensuring FP32 inputs. That matches the goal of promoting the indexer’s weight projection to FP32.

_qk_projection_and_rope and _prep_q_or_k centralize Q/K projection, normalization, RoPE, rotation, and FP8 quantization, avoiding duplicate logic and reusing fused indexer_k when provided.

The new forward flow keeps shapes coherent:

Q path: [T, n_heads, head_dim] → FP8 + per‑token scales → reshape to [T, n_heads, head_dim] and [T, n_heads, 1].

K path: [T, head_dim] → FP8 + scales as [T, head_dim] and [T, 1], matching _update_k_cache’s expectations.

_weight_scale applies the same scaling factor as before, now directly to the FP32 weights_proj output using the q-scale tensor.

Functionally this aligns with the existing sparse indexer pipeline while swapping the weight GEMM to FP32. Please double‑check that:

fp8_mqa_logits / fp8_paged_mqa_logits are happy with weights being float32 (not bf16), and

tests cover both fused (indexer_k provided) and unfused (indexer_k=None) Indexer paths after this change.

Also applies to: 720-727, 1242-1272, 1273-1312

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l · 2025-11-19T00:06:44Z

/bot run

tensorrt-cicd · 2025-11-19T00:12:46Z

PR_Github #24951 [ run ] triggered by Bot. Commit: b457bb6

tensorrt-cicd · 2025-11-19T01:52:07Z

PR_Github #24951 [ run ] completed with state SUCCESS. Commit: b457bb6
/LLM/main/L0_MergeRequest_PR pipeline #18849 completed with status: 'FAILURE'

tensorrt_llm/_torch/models/modeling_deepseekv3.py

tensorrt_llm/_torch/modules/attention.py

tensorrt_llm/_torch/attention_backend/sparse/dsa.py

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l · 2025-11-19T19:26:25Z

/bot run

tensorrt-cicd · 2025-11-19T19:32:25Z

PR_Github #25084 [ run ] triggered by Bot. Commit: e086160

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

tensorrt-cicd · 2025-11-20T03:56:35Z

PR_Github #25084 [ run ] completed with state SUCCESS. Commit: e086160
/LLM/main/L0_MergeRequest_PR pipeline #18962 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

chang-l · 2025-11-20T05:16:34Z

/bot reuse-pipeline

tensorrt-cicd · 2025-11-20T05:21:54Z

PR_Github #25147 [ reuse-pipeline ] triggered by Bot. Commit: 2df9023

tensorrt-cicd · 2025-11-20T05:52:36Z

PR_Github #25147 [ reuse-pipeline ] completed with state SUCCESS. Commit: 2df9023
Reusing PR_Github #25084 for commit 2df9023

chang-l force-pushed the indexer_weight_fp32 branch from eeadd4b to e6693d5 Compare November 18, 2025 02:42

chang-l changed the title ~~[None][fix] Use fp32 for indexer weight_proj GEMM~~ [None][fix] Use fp32 for indexer weight_proj GEMM Nov 18, 2025

lfr-0531 reviewed Nov 18, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

chang-l added 3 commits November 18, 2025 09:15

Unfuse weight_proj and promote to fp32

f5f322d

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Use torch compile cpy + tensorcore GEMM (use_custom_cublas_mm=False)

56cc10d

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Resolve branch conflict

e9489cf

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l force-pushed the indexer_weight_fp32 branch from 118120e to e9489cf Compare November 18, 2025 17:15

chang-l added 2 commits November 18, 2025 13:52

Refactor to optimize stream/op placement

d31b650

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

Unify fp8/nvfp4 code path

578565c

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l marked this pull request as ready for review November 18, 2025 22:12

chang-l requested review from a team as code owners November 18, 2025 22:12

chang-l requested review from PerkzZheng, dongxuy04 and yechank-nvidia November 18, 2025 22:13

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

Fix scale loading for fp8

b457bb6

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

chang-l requested a review from yuxianq November 18, 2025 23:55