Skip to content

✨[Feature] Resource aware Graph partitioner #3906

@narendasan

Description

@narendasan

Problem

Large models often fail or stall during TensorRT engine building because the compilation can use up to 5x of CPU memory, which exceeds host CPU memory limits. This could result in compilation freezing or being OOM-killed by the OS. Even with some optimizations, sometimes it's still hard to fit a big model into limited CPU RAM.

Solution

We insert a pass after capability partitioning that performs further partitioning according to resources available.
It's a resource-aware graph partitioning pass that refines the capability-based split by further dividing oversized accelerated subgraphs so each resulting TRT engine fits a conservative CPU memory budget. The pass should:

  • Reconstruct accelerated/non-accelerated subgraphs on the original torch.fx.GraphModule, preserving fusion groups and graph topological order.

  • Estimate per-subgraph “size” by traversing reachable get_attr weights and summing tensor bytes, deduplicating shared parameters.

  • Automatically determine the budget size of a subgraph according to available CPU memory or a user-defined CPU memory limit

  • Iteratively split any accelerated subgraph that exceeds the budget by moving nodes from the front into a new subgraph, validating partition correctness, and never breaking fusion groups.

  • Resulting in a roughly equal split across the whole graph in terms of parameter size

Alternatives

  • We tried splitting solely based on certain nodes (split after the SDPA node). This could not achieve a roughly equal split of the parameter sizes. However, we noticed some performance boost of the split graph compared to the original whole graph. This is counterintuitive and worth further investigation.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions