Skip to content

LR scheduler steps after saving checkpoint with iteration-based checkpointing #7637

@simran2905

Description

@simran2905

🐛 Bug

When checkpoints are saved using every_n_train_steps argument of ModelCheckpoint callback, checkpointing happens before lr_scheduler step being called. While it is the other way around with every_n_val_epochs argument.

In training_loop, on_train_batch_end (which saves checkpoint) is called before updating lr_scheduler. https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L500-L529

while on_evaluation_end is called after updating lrs. https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L996-L1011

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1bBkhGiKJoavp1O4Oi4kLWVnxohl5HR6M?usp=sharing

To Reproduce

Expected behavior

In the colab notebook, a training epoch is of 64 batches. When I save checkpoint using every_n_train_steps=64 or with every_n_val_epochs=1, the lr_scheduler step count should be the same. But the step count is 1 lesser with every_n_train_steps=64

Environment

  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

cc @ananthsub

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions