Clip norm after scaler.unscale_ in native fp16 training

## 🐛 Bug
I trained a large model using native amp, but the loss converged very slow. After a careful check of the backward and optimization code, I found the `clip_gradients` is executed right after backward, but `scaler.unscale_` is conducted in pre_optimization_step.
According to the [instruction of Pytorch](https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping), the order of clip and unscale should be exchanged. Currently `gradient_clip_val` may lead to a very flat learning curve if used together with native amp.
Hope to be fixed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clip norm after scaler.unscale_ in native fp16 training #9599

🐛 Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clip norm after scaler.unscale_ in native fp16 training #9599

Description

🐛 Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions