-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
When checkpoints are saved using every_n_train_steps argument of ModelCheckpoint callback, checkpointing happens before lr_scheduler step being called. While it is the other way around with every_n_val_epochs argument.
In training_loop, on_train_batch_end (which saves checkpoint) is called before updating lr_scheduler. https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L500-L529
while on_evaluation_end is called after updating lrs. https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L996-L1011
Please reproduce using the BoringModel
https://colab.research.google.com/drive/1bBkhGiKJoavp1O4Oi4kLWVnxohl5HR6M?usp=sharing
To Reproduce
Expected behavior
In the colab notebook, a training epoch is of 64 batches. When I save checkpoint using every_n_train_steps=64 or with every_n_val_epochs=1, the lr_scheduler step count should be the same. But the step count is 1 lesser with every_n_train_steps=64
Environment
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
cc @ananthsub