-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Milestone
Description
🐛 Bug
enable_pl_optimizer (default!) causes optimizers to not be restored properly from the checkpoint specified by resume_from_checkpoint.
BoringModel Colab Reproduction
The model is trained for 3 epochs and saved in a checkpoint. The checkpoint is then restored and further trained for 1 epoch (with different values of enable_pl_optimizer), the training loss is printed at each step.
The setup where enable_pl_optimizer=True shows a huge loss spike after the first optimizer step, suggesting that the optimizer is not restored properly.
https://colab.research.google.com/drive/1lHYXm4MpnmXwPZTcPem4D4wwwU5vJhHc?usp=sharing
Expected behavior
PL Optimizers are restored such that there is no huge loss spike after restore, just like when enable_pl_optimizer=False.
Environment
See Colab.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task