-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I couldn't find anywhere in the code where the gradient_clip_algorithm argument (implemented in #6123) got passed to Accelerator.clip_gradients method and suspected that the default algorithm (GradClipAlgorithmType.NORM) is always used no matter what.
After a brief investigation, I believe I've confirmed that it is the case and the original test case couldn't correctly detect it.
I'm not sure how to properly fix this bug yet but would like to issue a warning to other users (that only clipping by norm works at this moment).
To Reproduce
This commit firstly disabled the suppression of AssertionError in Trainer.run_train, and then test if the maximum gradient value is almost the same as the set 1e-5 threshold.
I ran the command pytest tests/trainer/test_trainer.py -k "test_gradient_clipping_by_value and not test_gradient_clipping_by_value_fp16" and got this:
FAILED tests/trainer/test_trainer.py::test_gradient_clipping_by_value - AssertionError: Gradient max value 3.6332883155409945e-06 != grad_clip_val 1e-05 .
If we change the default algorithm in PrecisionPlugin.clip_gradients to GradClipAlgorithmType.VALUE, we will pass this test case.
Alternatively, we can directly assert if the clip algorithm is by value in PrecisionPlugin.clip_gradients. We'll get the following error:
FAILED tests/trainer/test_trainer.py::test_gradient_clipping_by_value - AssertionError: GradClipAlgorithmType.NORM
By now we can clearly see that:
- Setting
gradient_clip_algorithmchanges nothing in the training procedure - The original test case cannot distinguish between the two clipping algorithms
- The
AssertionErrorin the original test case will be ignored anyway because of the design ofTrainer.run_train. (I'm not entirely sure of this one because I'm not familiar with the test environment setup. It appears so in my local environment for sure.)
Environment
- CUDA:
- GPU:
- GeForce RTX 2070
- available: True
- version: 11.0 - Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.3.0rc0
- tqdm: 4.49.0 - System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.9
- version: removed reduce on non-loss outputs from dp #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021