-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix current_epoch value on training end #8578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8578 +/- ##
======================================
Coverage 45% 45%
======================================
Files 218 218
Lines 14396 14393 -3
======================================
Hits 6450 6450
+ Misses 7946 7943 -3 |
7d24d4a to
1fc8ae7
Compare
awaelchli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the most important things to remember from this PR is the fact that on_train_epoch_end checkpoints are different from the regular checkpoint since they have a different "fake epoch" (current epoch) value saved.
|
just one quick ques: checkpoints saved during checkpointing... let's say during the last epoch, be it inside |
|
Correct. There are several tests testing both cases. |
thanks for clarifying.. yeah I saw... Just wanted a final confirmation to keep this change in mind. but global_step will be the same in both cases, right? |
|
Yes. I'm looking at global step in #11805 (don't look inside yet!) |
|
@carmocca TODO: Remove checkpointing fault tolerant states. |
Co-authored-by: Adrian Wälchli <[email protected]>
0e235fe to
7afb814
Compare
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
…ining end (#8578) Summary: ### New commit log messages - [789fae828 Fix `current_epoch` value on training end (#8578)](Lightning-AI/pytorch-lightning#8578) Reviewed By: tangbinh Differential Revision: D34398730 fbshipit-source-id: 2731e46ebbf3b5e62d9266fba5933b4a43eca4e9
What does this PR do?
Fixes #5007, #4385, #11425
Part of #7406
The progress tracking state is now saved/loaded by default.
This PR uses two different epoch values:
We have the “representation epoch”, kept for backwards compatibility
trainer.current_epochtrainer.fit_loop.epoch_progress.current.completedUsed for the checkpoint name and the naive epoch value saved in the checkpoint
And the “actual epoch”
trainer.fit_loop.epoch_progress.current.processedUsed to reload on restart.
Does your PR introduce any breaking changes? If yes, please list them.
Trainer(max_epochs=1, limit_train_batches=1)which saves a checkpointon_train_epoch_endwill have thecurrent_epoch=0, global_step=1values saved. This is because we consideron_train_epoch_endto be part of the epoch itself:current_epochis now increased by 1on_train_end. This means that if a model is run for 3 epochs (0, 1, 2),trainer.current_epochwill return3. This is consistent with the fact that a new trainer returnstrainer.current_epoch == 0, as in, the 0-th (first) epoch needs to run. This will be breaking for anybody accessing thetrainer.current_epochattribute afterfitis finishedBefore submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
🤪
this was painful
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @Borda @carmocca @justusschock