Skip to content

ModelCheckpoint state does not get reloaded after moving on_init_end implementation #11110

@awaelchli

Description

@awaelchli

🐛 Bug

Discovered in #10995. Present only in master. Appeared after #10940.

The state in ModelCheckpoint does not get reloaded because the state_key does not match. This is a regression caused by #10940 which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.on_pretrain_routine_start, causing the Trainer to attempt reloading the state BEFORE the state_key is fully defined. In particular, the save_on_train_epoch_end key gets determined lazily.

The bug is not present in any release.

To Reproduce

This can be reproduced by setting max_epochs=1 in stead of max_steps=1 in tests/models/test_restore.py::test_callbacks_state_fit_ckpt_path.

Expected behavior

Callback state gets reloaded when calling Trainer.fit and ModelCheckpoint arguments are unmodified.

Environment

You can also fill out the list below manually.
-->

  • PyTorch Lightning Version (e.g., 1.5.0): master (15.12.2021)
  • PyTorch Version (e.g., 1.10): 1.10
  • Python version (e.g., 3.9):
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @tchaton @carmocca @awaelchli @ninginthecloud @jjenniferdai @ananthsub @daniellepintz @justusschock

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions