Skip to content

Conversation

@carmocca
Copy link
Contributor

@carmocca carmocca commented May 26, 2021

What does this PR do?

Changes:

  • Replace training_loop.on_train_end call to check_checkpoint_callback with the on_train_end hook implementation in ModelCheckpoint
  • Always rely on the ModelCheckpoint hooks to save.

This resolves a bug where an extra checkpoint was saved at the end of training if the val_check_interval did not align with the training batches. In that case, a checkpoint was always saved as if save_last was set to True even if it was not. This change is reflected in the tests.

Fixes #6672

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [n/a] Did you make sure to update the documentation with your changes? (if necessary)
  • [n/a] Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • [n/a] Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@pep8speaks
Copy link

pep8speaks commented May 26, 2021

Hello @carmocca! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-19 10:52:34 UTC

@carmocca carmocca marked this pull request as draft May 26, 2021 13:27
@codecov
Copy link

codecov bot commented May 26, 2021

Codecov Report

Merging #7724 (6ad0a5e) into master (999ef5c) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7724    +/-   ##
=======================================
- Coverage      93%     88%    -5%     
=======================================
  Files         217     217            
  Lines       14227   14218     -9     
=======================================
- Hits        13201   12530   -671     
- Misses       1026    1688   +662     

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mergify mergify bot removed the has conflicts label Jul 14, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@awaelchli awaelchli added checkpointing Related to checkpointing ready PRs ready to be merged labels Jul 15, 2021
@awaelchli awaelchli enabled auto-merge (squash) July 15, 2021 17:56
@mergify mergify bot removed the has conflicts label Jul 15, 2021
@carmocca carmocca disabled auto-merge July 16, 2021 01:02
@carmocca carmocca force-pushed the refactor/remove-check-ckpt-callback branch from 84b05d1 to b709a8f Compare July 16, 2021 01:30
@mergify mergify bot removed the has conflicts label Jul 19, 2021
@carmocca carmocca enabled auto-merge (squash) July 19, 2021 10:53
@carmocca carmocca merged commit 710df39 into master Jul 19, 2021
@carmocca carmocca deleted the refactor/remove-check-ckpt-callback branch July 19, 2021 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

checkpointing Related to checkpointing priority: 0 High priority task ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Training Loop Checkpoint Consolidation

8 participants