Skip to content

🐛 [Bug] perf gap reduce on BERT #3702

@zewenli98

Description

@zewenli98

Bug Description

Compare the perf of Torch-TRT against ONNX-TRT.

In fp32:

  1. Skipping constant folding of embedding layers can reduce engine size. It doesn't affect latency or precision
  2. Disabling linear decomposition + adding linear converter doesn't affect latency
  3. opt_level=3 or 5 get almost same latency
  4. onnx-trt takes much longer in compile time
  5. torch-trt is ~2.5% slower than onnx-trt

In fp16:

  1. Skipping constant folding of embedding layers can reduce engine size. It doesn't affect latency or precision
  2. Disabling linear decomposition + adding linear converter reduces ~18% latency
  3. opt_level=3 or 5 get almost same latency
  4. onnx-trt takes much longer in compile time
  5. torch-trt is ~11% slower than onnx-trt

To Reproduce

run perf_run.py script

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions