Fix current_epoch value on training end #8578

carmocca · 2021-07-27T12:24:28Z

What does this PR do?

Fixes #5007, #4385, #11425
Part of #7406

The progress tracking state is now saved/loaded by default.

This PR uses two different epoch values:

We have the “representation epoch”, kept for backwards compatibility

aka “fake epoch”
aka trainer.current_epoch
aka trainer.fit_loop.epoch_progress.current.completed
Used for the checkpoint name and the naive epoch value saved in the checkpoint

And the “actual epoch”

aka trainer.fit_loop.epoch_progress.current.processed
Used to reload on restart.

Does your PR introduce any breaking changes? If yes, please list them.

With this PR, a Trainer(max_epochs=1, limit_train_batches=1) which saves a checkpointon_train_epoch_end will have the current_epoch=0, global_step=1 values saved. This is because we consider on_train_epoch_end to be part of the epoch itself:

for epoch in epochs:
  ...
  trainer.fit_loop.epoch_progress.current.processed++
  on_train_epoch_end()
  current_epoch++ # (aka trainer.fit_loop.epoch_progress.current.completed++)
on_train_end()

current_epoch is now increased by 1 on_train_end. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch will return 3. This is consistent with the fact that a new trainer returns trainer.current_epoch == 0, as in, the 0-th (first) epoch needs to run. This will be breaking for anybody accessing the trainer.current_epoch attribute after fit is finished

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

🤪
this was painful

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @Borda @carmocca @justusschock

codecov · 2021-07-27T12:26:10Z

Codecov Report

Merging #8578 (c26988e) into master (7914e49) will increase coverage by 0%.
The diff coverage is 0%.

@@          Coverage Diff           @@
##           master   #8578   +/-   ##
======================================
  Coverage      45%     45%           
======================================
  Files         218     218           
  Lines       14396   14393    -3     
======================================
  Hits         6450    6450           
+ Misses       7946    7943    -3

tests/callbacks/test_timer.py

pytorch_lightning/loops/fit_loop.py

tests/models/test_hooks.py

awaelchli

I think one of the most important things to remember from this PR is the fact that on_train_epoch_end checkpoints are different from the regular checkpoint since they have a different "fake epoch" (current epoch) value saved.

tests/trainer/test_trainer.py

tests/trainer/test_dataloaders.py

tests/trainer/optimization/test_optimizers.py

tests/models/test_hooks.py

tests/loops/test_training_loop.py

tests/checkpointing/test_model_checkpoint.py

tests/callbacks/test_timer.py

tests/callbacks/test_early_stopping.py

pytorch_lightning/trainer/connectors/checkpoint_connector.py

pytorch_lightning/loops/fit_loop.py

pytorch_lightning/loops/epoch/training_epoch_loop.py

rohitgr7 · 2022-02-08T19:43:40Z

just one quick ques:

checkpoints saved during checkpointing... let's say during the last epoch, be it inside on_validation_end or on_train_epoch_end, the epoch values saved and reloaded will be different from what will be saved if trainer.save_checkpoint is called after training?

carmocca · 2022-02-08T19:46:47Z

Correct. There are several tests testing both cases.

rohitgr7 · 2022-02-08T19:51:06Z

Correct. There are several tests testing both cases.

thanks for clarifying.. yeah I saw... Just wanted a final confirmation to keep this change in mind.

but global_step will be the same in both cases, right?

carmocca · 2022-02-08T19:52:18Z

Yes.

I'm looking at global step in #11805 (don't look inside yet!)

pytorch_lightning/callbacks/stochastic_weight_avg.py

pytorch_lightning/loops/base.py

pytorch_lightning/loops/fit_loop.py

pytorch_lightning/loops/epoch/training_epoch_loop.py

pytorch_lightning/loops/fit_loop.py

pytorch_lightning/loops/utilities.py

pytorch_lightning/trainer/connectors/checkpoint_connector.py

pytorch_lightning/tuner/batch_size_scaling.py

tchaton · 2022-02-09T14:49:53Z

@carmocca TODO: Remove checkpointing fault tolerant states.

Co-authored-by: Adrian Wälchli <[email protected]>

tchaton

LGTM !

Co-authored-by: Adrian Wälchli <[email protected]>

…ining end (#8578) Summary: ### New commit log messages - [789fae828 Fix `current_epoch` value on training end (#8578)](Lightning-AI/pytorch-lightning#8578) Reviewed By: tangbinh Differential Revision: D34398730 fbshipit-source-id: 2731e46ebbf3b5e62d9266fba5933b4a43eca4e9

carmocca added bug Something isn't working breaking change Includes a breaking change labels Jul 27, 2021

carmocca added this to the v1.5 milestone Jul 27, 2021

carmocca self-assigned this Jul 27, 2021

carmocca mentioned this pull request Jul 28, 2021

[WIP] Add Loop.stop() #8604

Closed

12 tasks

awaelchli modified the milestones: v1.5, v1.6 Nov 1, 2021

carmocca added checkpointing Related to checkpointing loops Related to the Loop API labels Nov 26, 2021

carmocca force-pushed the bugfix/5007 branch from 7d24d4a to 1fc8ae7 Compare November 27, 2021 00:33

carmocca mentioned this pull request Nov 27, 2021

Do not sanity check on reload #10785

Merged

12 tasks

carmocca added the fault tolerance label Dec 17, 2021

carmocca commented Dec 23, 2021

View reviewed changes

tests/callbacks/test_timer.py Show resolved Hide resolved

carmocca force-pushed the bugfix/5007 branch from b4bd268 to a304dba Compare January 9, 2022 02:48

carmocca mentioned this pull request Jan 12, 2022

Set Loop.restarting recursively #11442

Merged

12 tasks

carmocca commented Jan 14, 2022

View reviewed changes

pytorch_lightning/loops/fit_loop.py Show resolved Hide resolved

carmocca commented Jan 17, 2022

View reviewed changes

tests/models/test_hooks.py Show resolved Hide resolved

carmocca mentioned this pull request Jan 17, 2022

Fix checkpoint values when saving and resetting the tuner state #11518

Merged

10 tasks

carmocca changed the base branch from master to feature/tuner-ckpt-connector January 17, 2022 23:33

carmocca changed the title ~~[WIP] Fix current_epoch value on training end~~ Fix current_epoch value on training end Jan 17, 2022

carmocca mentioned this pull request Jan 20, 2022

Fix val_loop run on restart #11552

Merged

12 tasks

Base automatically changed from feature/tuner-ckpt-connector to master January 20, 2022 18:54

carmocca force-pushed the bugfix/5007 branch from 3139fd5 to dc19a8b Compare January 20, 2022 19:39

carmocca added a commit that referenced this pull request Jan 21, 2022

Changes in preparation to #8578

b65ef05

carmocca mentioned this pull request Jan 21, 2022

Changes in preparation to #8578 #11562

Merged

11 tasks

carmocca changed the base branch from master to refactor/preparation-8578 January 21, 2022 03:14

carmocca force-pushed the bugfix/5007 branch from 8ecc163 to 65e7742 Compare January 21, 2022 03:20

carmocca mentioned this pull request Jan 21, 2022

global_step/current_epoch issues #7406

Closed

awaelchli reviewed Jan 23, 2022

View reviewed changes

awaelchli approved these changes Feb 7, 2022

View reviewed changes

pytorch_lightning/loops/epoch/training_epoch_loop.py Outdated Show resolved Hide resolved

justusschock approved these changes Feb 8, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Feb 8, 2022

rohitgr7 approved these changes Feb 8, 2022

View reviewed changes

tchaton reviewed Feb 9, 2022

View reviewed changes

mergify bot added the has conflicts label Feb 9, 2022

carmocca and others added 7 commits February 10, 2022 16:23

Fix current_epoch value on training end

abd6895

Remove code from #11552

9228953

Apply #11556

17ebb0c

Undo fit_loop.done change

efdc403

Update pytorch_lightning/loops/epoch/training_epoch_loop.py

b2db296

Co-authored-by: Adrian Wälchli <[email protected]>

Keep FitLoop.done check

1be025d

Comments requested by Thomas

7afb814

carmocca force-pushed the bugfix/5007 branch from 0e235fe to 7afb814 Compare February 10, 2022 15:42

mergify bot removed the has conflicts label Feb 10, 2022

tchaton approved these changes Feb 10, 2022

View reviewed changes

carmocca merged commit 789fae8 into master Feb 10, 2022

carmocca deleted the bugfix/5007 branch February 10, 2022 16:56

carmocca added a commit that referenced this pull request Feb 10, 2022

Fix current_epoch value on training end (#8578)

b687fd1

Co-authored-by: Adrian Wälchli <[email protected]>

adamreeve mentioned this pull request Feb 14, 2022

Support checkpoint save and load with Stochastic Weight Averaging #9938

Merged

13 tasks

rohitgr7 pushed a commit that referenced this pull request Feb 17, 2022

Fix current_epoch value on training end (#8578)

99fa4cd

Co-authored-by: Adrian Wälchli <[email protected]>

carmocca mentioned this pull request Feb 25, 2022

Do not prefetch when possible #12101

Merged

11 tasks

carmocca mentioned this pull request Mar 23, 2022

Fix current epoch value override on restart #12429

Merged

10 tasks

jorshi mentioned this pull request Apr 29, 2022

Fix event prediction model epochs hearbenchmark/hear-eval-kit#364

Merged

Fix current_epoch value on training end #8578

Fix current_epoch value on training end #8578

Uh oh!

Conversation

carmocca commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohitgr7 commented Feb 8, 2022

Uh oh!

carmocca commented Feb 8, 2022

Uh oh!

rohitgr7 commented Feb 8, 2022

Uh oh!

carmocca commented Feb 8, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton commented Feb 9, 2022

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

carmocca commented Jul 27, 2021 •

edited

Loading

codecov bot commented Jul 27, 2021 •

edited

Loading