Skip to content

Conversation

@2ez4bz
Copy link
Collaborator

@2ez4bz 2ez4bz commented Nov 17, 2025

Summary by CodeRabbit

  • New Features

    • Added public API to register custom model implementations via register_custom_model_cls() method for flexible model extensibility.
    • Introduced support for NemotronH model with Mamba-inspired architecture, including specialized mixer, attention, and normalization components.
  • Tests

    • Added test coverage for custom model registration and building workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Why?

The reference nemotron H code on HuggingFace is out of date, and therefore bugged, and has several untested code paths. This makes an already hairy patching system even hairier.

The proposal is to do away with those patches, and replace the original implementation with one that is heavily slimmed down.

What?

This PR sets the basis for an alternative path with such an slimmed down implementation that:

  • fixes bugs in the current HF implementation
  • adds no new dependencies to TensorRT-LLM
  • does away with unnecessary features for TensorRT-LLM / AutoDeploy:
    • no training related code (dropout, gradient checkpointing, etc.)
    • no caching business (we want to replace it with our own anyway)
    • no attention masking where possible
  • reuses existing AD custom ops for mamba SSM update / causal conv1d

In order for the above to be usable in the AD apparatus, some extensions to its LlmArgs and AutoModelForCausalLMFactory are made.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@2ez4bz 2ez4bz marked this pull request as ready for review November 19, 2025 22:01
@2ez4bz 2ez4bz requested a review from a team as a code owner November 19, 2025 22:01
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

📝 Walkthrough

Walkthrough

The changes introduce a pluggable custom model registration system for AutoModelForCausalLMFactory and implement a complete PyTorch NemotronH model with custom Mamba-inspired mixers, attention, and feed-forward components. Tests validate the new registration mechanism and update test utilities for input handling consistency.

Changes

Cohort / File(s) Summary
Custom Model Registration System
tensorrt_llm/_torch/auto_deploy/models/hf.py
Added _custom_model_mapping dictionary and register_custom_model_cls() class method to AutoModelForCausalLMFactory and AutoModelForImageTextToTextFactory. Modified _build_model() to check custom mapping before falling back to standard automodel_cls.from_config() flow.
NemotronH Model Implementation
tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py
New module implementing NemotronHForCausalLM, NemotronHModel, and supporting components including NemotronHMamba2Mixer (state-space mixer), NemotronHAttention, NemotronHMLP, NemotronHBlock, NemotronHPreTrainedModel, and custom output dataclasses. Includes RMSNorm, weight initialization, and generation support. Registers with factory as "nemotron_h".
Custom Model Registration Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hf.py
Added fixture to restore factory state and new test cases: test_register_custom_model_cls(), test_build_model_uses_custom_model_cls_from_config(), test_build_model_uses_custom_model_cls_directly(). Added imports for mocking and error validation.
Test Input Handling Updates
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py
Refactored test parameterization from pytest.param to plain tuples and updated input preparation to explicitly extract input_ids before passing to model.generate(). Reformatted function signatures and updated helper comments.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant Factory as AutoModelForCausalLMFactory
    participant Registry as _custom_model_mapping
    participant CustomModel
    participant StdAutoModel

    Client->>Factory: register_custom_model_cls("nemotron_h", CustomModel)
    Factory->>Registry: Store mapping
    
    Client->>Factory: build_model(model_config)
    Factory->>Factory: Extract model_type from config
    Factory->>Registry: Lookup model_type
    Registry-->>Factory: CustomModel found
    Factory->>CustomModel: _from_config or __init__
    CustomModel-->>Factory: Instance created
    Factory-->>Client: Return model instance
Loading
sequenceDiagram
    participant Client
    participant Factory as AutoModelForCausalLMFactory
    participant Registry as _custom_model_mapping
    participant StdAutoModel
    
    Client->>Factory: build_model(model_config)
    Factory->>Factory: Extract model_type from config
    Factory->>Registry: Lookup model_type
    Registry-->>Factory: No custom model registered
    Factory->>StdAutoModel: from_config(model_config)
    StdAutoModel-->>Factory: Instance created
    Factory-->>Client: Return model instance
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • modeling_nemotron_h.py — Largest addition with 10+ classes requiring careful review of mixer logic, attention implementation, weight initialization, and HuggingFace integration patterns
  • hf.py — Straightforward addition of dictionary and two simple class methods; low complexity
  • test_hf.py — New tests follow standard mocking patterns; verify registration and dual code paths (_from_config and direct construction)
  • test_hybrid_patches.py — Mechanical refactoring of test inputs; verify parameter updates don't alter test intent

Areas requiring extra attention:

  • NemotronHMamba2Mixer forward pass and state-space computations; verify A, B, C, D transformations align with design intent
  • Weight initialization in NemotronHPreTrainedModel, particularly mamba-specific A_log and dt handling
  • Custom model class instantiation in _build_model() — ensure both _from_config and direct constructor paths are properly tested
  • Input handling changes in test_hybrid_patches.py — confirm tokenizer output extraction doesn't break existing test assertions

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: introducing a slimmed down implementation of the Nemotron H model, which aligns with the PR's core objective.
Description check ✅ Passed The PR description provides a clear explanation of why the change is needed (outdated HF implementation), what is being changed (slimmed down implementation), and the key benefits (bug fixes, no new dependencies, removed unnecessary features).
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4abb86 and 2b59298.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (3 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hf.py (3 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py
📚 Learning: 2025-08-27T16:22:10.695Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py:94-100
Timestamp: 2025-08-27T16:22:10.695Z
Learning: When there are inconsistent operator detection methods (like custom_op() vs target_op()), removing one method and standardizing on the other is often cleaner than supporting both methods simultaneously.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py
📚 Learning: 2025-09-03T13:16:38.028Z
Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py:286-308
Timestamp: 2025-09-03T13:16:38.028Z
Learning: In test files, temporary monkey-patches for upstream bugs can be kept simple when they are explicitly intended to be removed soon, rather than investing effort in making them more robust.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
tensorrt_llm/bench/benchmark/__init__.py (1)
  • model_type (71-72)
tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py (3)
tensorrt_llm/functional.py (3)
  • sqrt (450-454)
  • conv1d (3548-3585)
  • arange (1498-1569)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • register_custom_model_cls (494-508)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
  • torch_attention (96-212)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hf.py (1)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
  • AutoModelForCausalLMFactory (100-508)
  • register_custom_model_cls (494-508)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
  • to (492-499)
  • device (217-218)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
  • to (42-46)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/auto_deploy/models/hf.py

115-115: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py

262-262: Avoid specifying long messages outside the exception class

(TRY003)


314-314: Avoid specifying long messages outside the exception class

(TRY003)


381-381: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


470-470: Unused method argument: prefix

(ARG002)


470-470: Unused method argument: args

(ARG002)


486-486: Unused method argument: kwargs

(ARG002)


489-489: Avoid specifying long messages outside the exception class

(TRY003)


505-505: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


532-532: Unused method argument: position_ids

(ARG002)


533-533: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hybrid_patches.py (3)

19-21: LGTM! Enhanced test coverage.

Adding test cases for both verification modes provides better coverage of the generation path.


69-71: LGTM! Clear documentation of constraint.

The updated comment clearly explains the tokenization length requirement and the workaround being used.


175-178: LGTM! More explicit input handling.

The explicit extraction of input_ids before moving to device improves code clarity and consistency with the rest of the file.

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

212-229: LGTM! Clean custom model routing implementation.

The logic correctly:

  • Checks for a custom model class based on model_type
  • Prefers _from_config (HF convention) when available
  • Falls back to the default behavior when no custom model is registered

493-508: LGTM! Clean registration API.

The registration method provides a simple, well-documented interface for custom model routing.

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_hf.py (3)

23-27: LGTM! Proper test isolation.

The fixture correctly preserves and restores the custom model mapping between tests, ensuring test independence.


137-144: LGTM! Clear test of registration functionality.

The test verifies that register_custom_model_cls correctly stores the custom model class in the mapping.


151-185: LGTM! Comprehensive test coverage of custom model paths.

Both tests effectively verify:

  • Custom model class is invoked based on model_type
  • Both construction paths (_from_config and direct) work correctly
  • Error propagation from custom model classes is preserved
tensorrt_llm/_torch/auto_deploy/models/modeling_nemotron_h.py (8)

38-86: LGTM! Self-contained RMS norm implementation.

The implementation correctly replaces the mamba_ssm dependency with a PyTorch + einops version, handling grouped gated RMS normalization appropriately.


158-224: LGTM! Proper custom ops integration.

The Mamba2Mixer correctly integrates torch_causal_conv1d and torch_ssm custom ops, with appropriate dtype handling throughout.


245-272: LGTM! Correct block implementation.

The block correctly switches between mixer types (mamba, attention, mlp) and applies standard pre-norm with residual connections.


306-368: LGTM! Correct attention implementation.

The attention module properly uses the torch_attention custom op with bsnd layout, matching the expected interface.


385-428: LGTM! Proper weight initialization.

The initialization logic correctly handles Mamba-specific parameters (A_log, D, dt_bias) and follows the GPT-2 rescaling scheme for prenorm residual connections.


470-501: LGTM! Correct model implementation.

The model properly:

  • Handles legacy checkpoint compatibility via load_hook
  • Validates input arguments (exactly one of input_ids or inputs_embeds)
  • Applies standard transformer flow: embeddings → layers → final norm

Note: The unused prefix and args parameters in load_hook are part of the PyTorch hook signature and cannot be removed.


516-540: LGTM! Correct causal LM implementation.

The implementation properly:

  • Delegates embedding access to the backbone
  • Applies dtype conversion for the LM head
  • Provides the interface needed for GenerationMixin

Note: The unused position_ids and kwargs parameters are required for compatibility with HuggingFace's generation interface and cannot be removed.


543-543: LGTM! Proper model registration.

The registration correctly wires NemotronHForCausalLM into the custom model factory system for the "nemotron_h" model type.

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Nov 20, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25148 [ run ] triggered by Bot. Commit: 2b59298

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25148 [ run ] completed with state SUCCESS. Commit: 2b59298
/LLM/main/L0_MergeRequest_PR pipeline #19014 completed with status: 'FAILURE'

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Nov 21, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25321 [ run ] triggered by Bot. Commit: 77ccb30

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25321 [ run ] completed with state SUCCESS. Commit: 77ccb30
/LLM/main/L0_MergeRequest_PR pipeline #19155 completed with status: 'FAILURE'

Copy link
Collaborator

@govind-ramnarayan govind-ramnarayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some small nits

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Nov 22, 2025

/bot run

@2ez4bz 2ez4bz enabled auto-merge (squash) November 22, 2025 05:57
@tensorrt-cicd
Copy link
Collaborator

PR_Github #25412 [ run ] triggered by Bot. Commit: aff6752

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25412 [ run ] completed with state SUCCESS. Commit: aff6752
/LLM/main/L0_MergeRequest_PR pipeline #19228 completed with status: 'FAILURE'

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Nov 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25427 [ run ] triggered by Bot. Commit: aff6752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants