-
Notifications
You must be signed in to change notification settings - Fork 370
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
Compare the perf of Torch-TRT against ONNX-TRT.
In fp32:
- Skipping constant folding of embedding layers can reduce engine size. It doesn't affect latency or precision
- Disabling linear decomposition + adding linear converter doesn't affect latency
- opt_level=3 or 5 get almost same latency
- onnx-trt takes much longer in compile time
- torch-trt is ~2.5% slower than onnx-trt
In fp16:
- Skipping constant folding of embedding layers can reduce engine size. It doesn't affect latency or precision
- Disabling linear decomposition + adding linear converter reduces ~18% latency
- opt_level=3 or 5 get almost same latency
- onnx-trt takes much longer in compile time
- torch-trt is ~11% slower than onnx-trt
To Reproduce
run perf_run.py
script
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working