## Bug Description Compare the perf of Torch-TRT against ONNX-TRT: In fp16: 1. Skipping constant folding of embedding layers doesn't affect engine size or latency or precision 2. Disabling linear decomposition + adding linear converter reduces ~15% latency 3. opt_level=3 or 5 get almost same latency 4. onnx-trt takes much longer in compile time 5. torch-trt is ~9% slower than onnx-trt