Skip to content

Trainer(gradient_clip_algorithm='value') has no effect (from #6123) #6920

@ceshine

Description

@ceshine

🐛 Bug

I couldn't find anywhere in the code where the gradient_clip_algorithm argument (implemented in #6123) got passed to Accelerator.clip_gradients method and suspected that the default algorithm (GradClipAlgorithmType.NORM) is always used no matter what.

After a brief investigation, I believe I've confirmed that it is the case and the original test case couldn't correctly detect it.

I'm not sure how to properly fix this bug yet but would like to issue a warning to other users (that only clipping by norm works at this moment).

To Reproduce

This commit firstly disabled the suppression of AssertionError in Trainer.run_train, and then test if the maximum gradient value is almost the same as the set 1e-5 threshold.

I ran the command pytest tests/trainer/test_trainer.py -k "test_gradient_clipping_by_value and not test_gradient_clipping_by_value_fp16" and got this:

FAILED tests/trainer/test_trainer.py::test_gradient_clipping_by_value - AssertionError: Gradient max value 3.6332883155409945e-06 != grad_clip_val 1e-05 .

If we change the default algorithm in PrecisionPlugin.clip_gradients to GradClipAlgorithmType.VALUE, we will pass this test case.

Alternatively, we can directly assert if the clip algorithm is by value in PrecisionPlugin.clip_gradients. We'll get the following error:

FAILED tests/trainer/test_trainer.py::test_gradient_clipping_by_value - AssertionError: GradClipAlgorithmType.NORM

By now we can clearly see that:

  1. Setting gradient_clip_algorithm changes nothing in the training procedure
  2. The original test case cannot distinguish between the two clipping algorithms
  3. The AssertionError in the original test case will be ignored anyway because of the design of Trainer.run_train. (I'm not entirely sure of this one because I'm not familiar with the test environment setup. It appears so in my local environment for sure.)

Environment

  • CUDA:
    - GPU:
    - GeForce RTX 2070
    - available: True
    - version: 11.0
  • Packages:
    - numpy: 1.19.2
    - pyTorch_debug: False
    - pyTorch_version: 1.7.1
    - pytorch-lightning: 1.3.0rc0
    - tqdm: 4.49.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - processor: x86_64
    - python: 3.7.9
    - version: removed reduce on non-loss outputs from dp #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions