Skip to content

Conversation

@awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Dec 16, 2021

What does this PR do?

Fixes #11110
Fixes a combination of two bugs discovered in #10995.

  1. The test test_callbacks_state_fit_ckpt_path ran with max_steps=1 instead of max_epochs=1 never creating a checkpoint. This had the effect of hiding bug 2:
  2. The state does not get reloaded because the state_key does not match. This is a regression caused by Deprecate callback hooks on_init_start and on_init_end #10940 which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.on_pretrain_routine_start, causing the Trainer to attempt reloading the state BEFORE the state_key is fully defined. In particular, the save_on_train_epoch_end key gets determined lazily.

This PR fixes these two issues by setting max_epochs=1 in the test and moving the implementation of ModelCheckpoint.on_pretrain_routine_start to ModelCheckpoint.setup which runs BEFORE callback states are reloaded.

The latter is a hotfix, but a discussion should be held whether we want to reintroduce/de-deprecate on_init_end.
@ananthsub @daniellepintz @justusschock

The bug is not present in any release. Milestone 1.6.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

Part of #1 (it's a lie, this is just here to avoid noisy GitHub bot)

cc @Borda @tchaton @carmocca @awaelchli @ninginthecloud @jjenniferdai

@awaelchli awaelchli added bug Something isn't working priority: 0 High priority task labels Dec 16, 2021
@awaelchli awaelchli added this to the 1.6 milestone Dec 16, 2021
@awaelchli awaelchli marked this pull request as ready for review December 16, 2021 22:42
@mergify mergify bot added the ready PRs ready to be merged label Dec 16, 2021
@awaelchli awaelchli merged commit e19d93f into master Dec 16, 2021
@awaelchli awaelchli deleted the bugfix/modelcheckpoint-states branch December 16, 2021 23:18
@daniellepintz
Copy link
Contributor

super nit from the description:

which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.setup

do you mean "which has moved the code in ModelCheckpoint.on_init_end to ModelCheckpoint.on_pretrain_routine_start"?

@awaelchli
Copy link
Contributor Author

Yes, typo in update. Was meant to say that this PR moves ModelCheckpoint.on_pretrain_routine_start to ModelCheckpoint.setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working callback: model checkpoint priority: 0 High priority task ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ModelCheckpoint state does not get reloaded after moving on_init_end implementation

6 participants