Skip to content

✨[Feature] Basic GPU and CPU memory control workflow #3908

@cehongwang

Description

@cehongwang

Problem Description

Torch-TensorRT compilation for large models (such as LLMs and diffusion models) can consume excessive CPU and GPU memory. This often leads to freezes, CUDA OOM errors, TensorRT compilation failures, or the operating system killing the process. The default behavior may use up to 5× the model size in CPU memory and 2× the model size in GPU memory, which is too high for many environments.

Solution

Provide compilation options that reduce redundant model copies on CPU/GPU memory.
Specifically:

Enable a memory-trimming mechanism (export TRIM_CPU_MEMORY=1).
Provide CPU offloading (offload_module_to_cpu=True) to move the original copy of the model to CPU to save GPU memory.
Lazy engine initialization (lazy_engine_init) to save GPU memory for following compilation when there are graph breaks.

Setting Effect Approx. Memory Ratio
Default Baseline behavior CPU: 5×, GPU: 2×
export TRIM_CPU_MEMORY=1 Reduces redundant CPU copies CPU: ~3×
offload_module_to_cpu=False Further reduces CPU copies CPU: ~2×
offload_module_to_cpu=True Reduces GPU usage, increases CPU usage GPU: ~1×, CPU: +1×
lazy_engine_init=True Reduces GPU usage when there are multiple subgraphs lower GPU memory

Proper configuration ensures efficient resource use, stable compilation, and predictable performance for large-scale models.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions