🐛 Bug
I trained a large model using native amp, but the loss converged very slow. After a careful check of the backward and optimization code, I found the clip_gradients is executed right after backward, but scaler.unscale_ is conducted in pre_optimization_step.
According to the instruction of Pytorch, the order of clip and unscale should be exchanged. Currently gradient_clip_val may lead to a very flat learning curve if used together with native amp.
Hope to be fixed.