-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Discovered in #10995. Present only in master. Appeared after #10940.
The state in ModelCheckpoint does not get reloaded because the state_key does not match. This is a regression caused by #10940 which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.on_pretrain_routine_start, causing the Trainer to attempt reloading the state BEFORE the state_key is fully defined. In particular, the save_on_train_epoch_end key gets determined lazily.
The bug is not present in any release.
To Reproduce
This can be reproduced by setting max_epochs=1 in stead of max_steps=1 in tests/models/test_restore.py::test_callbacks_state_fit_ckpt_path.
Expected behavior
Callback state gets reloaded when calling Trainer.fit and ModelCheckpoint arguments are unmodified.
Environment
You can also fill out the list below manually.
-->
- PyTorch Lightning Version (e.g., 1.5.0): master (15.12.2021)
- PyTorch Version (e.g., 1.10): 1.10
- Python version (e.g., 3.9):
- OS (e.g., Linux):
- CUDA/cuDNN version:
- GPU models and configuration:
- How you installed PyTorch (
conda,pip, source): - If compiling from source, the output of
torch.__config__.show(): - Any other relevant information:
Additional context
cc @tchaton @carmocca @awaelchli @ninginthecloud @jjenniferdai @ananthsub @daniellepintz @justusschock