-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
It seems like #5244 (which went out with 1.1.4) caused some bad interaction with auto_lr_find=True.
Specifically, lightning_optimizers are now cached on the Trainer. However, if we update the lr with auto_lr_find=True, we would expect the optimizers returned from configure_optimizers to change -- so that the lightning_optimizers need to be updated -- but this is no longer handled because we no longer re-wrap the optimizers in the general case.
The outcome for me is that training just doesnt converge because we're updating the wrong optimizer.
Please reproduce using the BoringModel
https://colab.research.google.com/drive/1PJGOBSUdl5_-U9O-fvo83V1On6_siwAC?usp=sharing
To Reproduce
See the colab^
Expected behavior
Training should work!
Environment
- CUDA:
- GPU:
- Tesla T4
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.7.1+cu101
- pytorch-lightning: 1.2.1
- tqdm: 4.41.1
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020
Additional context
-
This was a pretty frustrating bug to track down, it broke training on my model in a super unconnected way and I had to literally
git bisectboth my repo and pytorch-lightning's repo to find it. -
It's scary to me that the bug seems to have gone unnoticed for so many versions -- does no one use auto_lr_find=True? Are there no test cases checking this combination?