Initialize ModelCheckpoint state as early as possible #11108

awaelchli · 2021-12-16T16:25:26Z

What does this PR do?

Fixes #11110
Fixes a combination of two bugs discovered in #10995.

The test test_callbacks_state_fit_ckpt_path ran with max_steps=1 instead of max_epochs=1 never creating a checkpoint. This had the effect of hiding bug 2:
The state does not get reloaded because the state_key does not match. This is a regression caused by Deprecate callback hooks on_init_start and on_init_end #10940 which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.on_pretrain_routine_start, causing the Trainer to attempt reloading the state BEFORE the state_key is fully defined. In particular, the save_on_train_epoch_end key gets determined lazily.

This PR fixes these two issues by setting max_epochs=1 in the test and moving the implementation of ModelCheckpoint.on_pretrain_routine_start to ModelCheckpoint.setup which runs BEFORE callback states are reloaded.

The latter is a hotfix, but a discussion should be held whether we want to reintroduce/de-deprecate on_init_end.
@ananthsub @daniellepintz @justusschock

The bug is not present in any release. Milestone 1.6.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

Part of #1 (it's a lie, this is just here to avoid noisy GitHub bot)

cc @Borda @tchaton @carmocca @awaelchli @ninginthecloud @jjenniferdai

pytorch_lightning/callbacks/model_checkpoint.py

daniellepintz · 2021-12-17T17:27:38Z

super nit from the description:

which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.setup

do you mean "which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.on_pretrain_routine_start"?

awaelchli · 2021-12-17T17:35:15Z

Yes, typo in update. Was meant to say that this PR moves ModelCheckpoint.on_pretrain_routine_start to ModelCheckpoint.setup

Initialize ModelCheckpoint state as early as possible

084eeaf

awaelchli added bug Something isn't working priority: 0 High priority task labels Dec 16, 2021

awaelchli added this to the 1.6 milestone Dec 16, 2021

awaelchli added the callback: model checkpoint label Dec 16, 2021

fix

5777d66

awaelchli mentioned this pull request Dec 16, 2021

Add required states for resumed ModelCheckpoint GC #10995

Merged

12 tasks

justusschock approved these changes Dec 16, 2021

View reviewed changes

ananthsub reviewed Dec 16, 2021

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

awaelchli added 3 commits December 16, 2021 23:39

move to setup hook

28f5cae

Merge branch 'master' into bugfix/modelcheckpoint-states

5ab4565

clean up debug statements

4c35565

awaelchli marked this pull request as ready for review December 16, 2021 22:42

awaelchli requested review from Borda, SeanNaren, carmocca, kaushikb11, rohitgr7, tchaton and williamFalcon as code owners December 16, 2021 22:42

ananthsub approved these changes Dec 16, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Dec 16, 2021

rohitgr7 approved these changes Dec 16, 2021

View reviewed changes

carmocca approved these changes Dec 16, 2021

View reviewed changes

awaelchli merged commit e19d93f into master Dec 16, 2021

awaelchli deleted the bugfix/modelcheckpoint-states branch December 16, 2021 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initialize ModelCheckpoint state as early as possible #11108

Initialize ModelCheckpoint state as early as possible #11108

Uh oh!

awaelchli commented Dec 16, 2021 •

edited

Loading

Uh oh!

Uh oh!

daniellepintz commented Dec 17, 2021

Uh oh!

awaelchli commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Initialize ModelCheckpoint state as early as possible #11108

Initialize ModelCheckpoint state as early as possible #11108

Uh oh!

Conversation

awaelchli commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

daniellepintz commented Dec 17, 2021

Uh oh!

awaelchli commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

awaelchli commented Dec 16, 2021 •

edited

Loading