Skip to content

Conversation

@carmocca
Copy link
Contributor

@carmocca carmocca commented Jul 27, 2021

What does this PR do?

Fixes #5007, #4385, #11425
Part of #7406

The progress tracking state is now saved/loaded by default.

This PR uses two different epoch values:

We have the “representation epoch”, kept for backwards compatibility

  • aka “fake epoch”
  • aka trainer.current_epoch
  • aka trainer.fit_loop.epoch_progress.current.completed
    Used for the checkpoint name and the naive epoch value saved in the checkpoint

And the “actual epoch”

  • aka trainer.fit_loop.epoch_progress.current.processed
    Used to reload on restart.

Does your PR introduce any breaking changes? If yes, please list them.

  • With this PR, a Trainer(max_epochs=1, limit_train_batches=1) which saves a checkpointon_train_epoch_end will have the current_epoch=0, global_step=1 values saved. This is because we consider on_train_epoch_end to be part of the epoch itself:
for epoch in epochs:
  ...
  trainer.fit_loop.epoch_progress.current.processed++
  on_train_epoch_end()
  current_epoch++ # (aka trainer.fit_loop.epoch_progress.current.completed++)
on_train_end()
  • current_epoch is now increased by 1 on_train_end. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch will return 3. This is consistent with the fact that a new trainer returns trainer.current_epoch == 0, as in, the 0-th (first) epoch needs to run. This will be breaking for anybody accessing the trainer.current_epoch attribute after fit is finished

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

🤪
this was painful

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @Borda @carmocca @justusschock

@carmocca carmocca added bug Something isn't working breaking change Includes a breaking change labels Jul 27, 2021
@carmocca carmocca added this to the v1.5 milestone Jul 27, 2021
@carmocca carmocca self-assigned this Jul 27, 2021
@codecov
Copy link

codecov bot commented Jul 27, 2021

Codecov Report

Merging #8578 (c26988e) into master (7914e49) will increase coverage by 0%.
The diff coverage is 0%.

@@          Coverage Diff           @@
##           master   #8578   +/-   ##
======================================
  Coverage      45%     45%           
======================================
  Files         218     218           
  Lines       14396   14393    -3     
======================================
  Hits         6450    6450           
+ Misses       7946    7943    -3     

@carmocca carmocca mentioned this pull request Jul 28, 2021
12 tasks
@awaelchli awaelchli modified the milestones: v1.5, v1.6 Nov 1, 2021
@carmocca carmocca added checkpointing Related to checkpointing loops Related to the Loop API labels Nov 26, 2021
@carmocca carmocca mentioned this pull request Nov 27, 2021
12 tasks
@carmocca carmocca changed the base branch from master to feature/tuner-ckpt-connector January 17, 2022 23:33
@carmocca carmocca changed the title [WIP] Fix current_epoch value on training end Fix current_epoch value on training end Jan 17, 2022
@carmocca carmocca mentioned this pull request Jan 20, 2022
12 tasks
Base automatically changed from feature/tuner-ckpt-connector to master January 20, 2022 18:54
carmocca added a commit that referenced this pull request Jan 21, 2022
@carmocca carmocca mentioned this pull request Jan 21, 2022
11 tasks
@carmocca carmocca changed the base branch from master to refactor/preparation-8578 January 21, 2022 03:14
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the most important things to remember from this PR is the fact that on_train_epoch_end checkpoints are different from the regular checkpoint since they have a different "fake epoch" (current epoch) value saved.

@mergify mergify bot added the ready PRs ready to be merged label Feb 8, 2022
@rohitgr7
Copy link
Contributor

rohitgr7 commented Feb 8, 2022

just one quick ques:

checkpoints saved during checkpointing... let's say during the last epoch, be it inside on_validation_end or on_train_epoch_end, the epoch values saved and reloaded will be different from what will be saved if trainer.save_checkpoint is called after training?

@carmocca
Copy link
Contributor Author

carmocca commented Feb 8, 2022

Correct. There are several tests testing both cases.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Feb 8, 2022

Correct. There are several tests testing both cases.

thanks for clarifying.. yeah I saw... Just wanted a final confirmation to keep this change in mind.

but global_step will be the same in both cases, right?

@carmocca
Copy link
Contributor Author

carmocca commented Feb 8, 2022

Yes.

I'm looking at global step in #11805 (don't look inside yet!)

@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2022

@carmocca TODO: Remove checkpointing fault tolerant states.

@mergify mergify bot added the has conflicts label Feb 9, 2022
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@carmocca carmocca merged commit 789fae8 into master Feb 10, 2022
@carmocca carmocca deleted the bugfix/5007 branch February 10, 2022 16:56
carmocca added a commit that referenced this pull request Feb 10, 2022
rohitgr7 pushed a commit that referenced this pull request Feb 17, 2022
@carmocca carmocca mentioned this pull request Feb 25, 2022
11 tasks
facebook-github-bot pushed a commit to facebookresearch/mmf that referenced this pull request Feb 28, 2022
…ining end (#8578)

Summary:
### New commit log messages
- [789fae828 Fix `current_epoch` value on training end (#8578)](Lightning-AI/pytorch-lightning#8578)

Reviewed By: tangbinh

Differential Revision: D34398730

fbshipit-source-id: 2731e46ebbf3b5e62d9266fba5933b4a43eca4e9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Includes a breaking change bug Something isn't working checkpointing Related to checkpointing fault tolerance loops Related to the Loop API ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trainer.fit() multiple times with max_steps Duplicate epochs when calling .fit() twice Missing cleanup after trainer.fit() and trainer.test()

5 participants