Separate epoch validation from step validation #5208

rohitgr7 · 2020-12-20T20:16:30Z

What does this PR do?

Fixes #4603
Fixes #4655
Fixes #4797
Fixes #5156

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-12-20T20:16:34Z

Hello @rohitgr7! Thanks for updating this PR.

In the file pytorch_lightning/callbacks/model_checkpoint.py:

Line 234:13: W503 line break before binary operator

In the file pytorch_lightning/trainer/training_loop.py:

Line 592:17: W503 line break before binary operator
Line 593:17: W503 line break before binary operator
Line 895:13: W503 line break before binary operator
Line 896:13: W503 line break before binary operator
Line 899:13: W503 line break before binary operator

Comment last updated at 2021-02-08 08:00:12 UTC

codecov · 2020-12-20T20:40:15Z

Codecov Report

Merging #5208 (b0608b0) into master (3b7afb9) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #5208    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         134     134            
  Lines       10053   10051     -2     
=======================================
- Hits         9399    8980   -419     
- Misses        654    1071   +417

carmocca

Minor comments. Overall awesome!

pytorch_lightning/trainer/training_loop.py

Co-authored-by: Carlos Mocholí <[email protected]>

tchaton

Great PR ! Small question.

tchaton · 2021-02-05T08:08:21Z

pytorch_lightning/trainer/training_loop.py

+            # reset stage to train
+            self.trainer.logger_connector.set_stage("train")
+
+        should_skip_eval = self.trainer.evaluation_loop.should_skip_evaluation(self.trainer.num_val_batches)


Slightly confused about this part. Can you explain why we check val and then decide if we should skip it.

it's just to check whether there is any validation datasets available or not. If there isn't then we should run train_only_check else not. There are two cases for no validation, one when there is no validation_step other one when there is a validation_step but no validation_batches. Since even if we have a validation_step but no validation_batches, it used to skip the train_only_check, but ideally it should not.
Resolves: #4603 issue

@rohitgr7 I'm debugging an issue right now related to this.

pytorch_lightning/trainer/training_loop.py:753: Input params: batch_idx=1, is_last_batch=True, on_epoch=True pytorch_lightning/trainer/training_loop.py:755: batch_idx+1=2, trainer.val_check_batch=2, is_val_check_batch=True pytorch_lightning/trainer/training_loop.py:758: current_epoch+1=1, trainer.check_val_every_n_epoch=1, is_val_check_epoch=True pytorch_lightning/trainer/training_loop.py:761: enable_validation=True, is_val_check_epoch=True, can_check_val=True pytorch_lightning/trainer/training_loop.py:765: is_last_batch=True, trainer.val_check_batch=2, is_last_batch_for_infinite_dataset=False pytorch_lightning/trainer/training_loop.py:768: batch_idx + 1=2, trainer.num_training_batches=2, epoch_end_val_check=True pytorch_lightning/trainer/training_loop.py:774: is_val_check_batch=True, is_val_check_epoch=True, can_check_val=True, is_last_batch_for_infinite_dataset=False, epoch_end_val_check=True, should_check_val=True pytorch_lightning/trainer/training_loop.py:775: should_check_val=True, can_check_val=True pytorch_lightning/trainer/training_loop.py:487: should_check_val=True pytorch_lightning/trainer/training_loop.py:489: should_skip_eval=True, trainer.num_val_batches=[]

this check for should_skip_eval is forcing the should_train_only to be True, which causes the checkpoint callback to run before validation. The checkpoint is configured for a metric that appears only in validation, which leads to a failure. I don't get why should_skip_eval affects the should_train_only - shouldn't that be decided entirely by self.trainer.disable_validation ?

this could also be pointing to a bug in how self.trainer.num_val_batches is set

pytorch_lightning/trainer/training_loop.py

CHANGELOG.md

Borda · 2021-02-05T19:46:16Z

tests/callbacks/test_callbacks.py

+        call.on_epoch_end(trainer, model),
+        call.on_train_epoch_end(trainer, model, ANY),


I think that @williamFalcon had a point some time ago about training shall be till validation, and the example was with validation multiple times over long training...
cc: @tchaton @PyTorchLightning/core-contributors

yes it still works like that only if val_check_interval < 1.0 or it an int where val_check_interval % num_training_batches != 0. But if it is set to 1.0 then validation here happens after training_epoch because we create checkpoints in on_validation_end and epoch level learning rates are updated once training is done since in case of ReduceLROnPlateau we need to have the monitor metrics and they are only available after complete training is done in case monitor is training specific.

CHANGELOG.md

.

* Seperate epoch validaton from step validation * update system * test * baked logic in callbacks * unbake logic in callbacks * fix the call for scheduler * use property * pep * correct rebase * gitignore * ref * add tests * fix * add early stopping test * trigger * chlog * rev * 1.3 * log * Apply suggestions from code review Co-authored-by: Carlos Mocholí <[email protected]> * Update pytorch_lightning/trainer/training_loop.py * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: chaton <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit e429f97)

This was referenced Dec 21, 2020

LR scheduler lags behind after resume_from_checkpoint #4655

Closed

Fix lr scheduler to update before creating the checkpoint #4726

Closed

github-actions bot added the has conflicts label Jan 12, 2021

rohitgr7 mentioned this pull request Jan 12, 2021

Continuing training when using learning rate schedulers #5486

Closed

rohitgr7 added 7 commits January 17, 2021 16:05

Seperate epoch validaton from step validation

9c40fee

update system

ed6ebf1

test

236b052

baked logic in callbacks

788203b

unbake logic in callbacks

c7b24ca

fix the call for scheduler

42b0c7b

use property

0cc0254

rohitgr7 force-pushed the bugfix/ep_end_ckpt branch from 215b954 to 0cc0254 Compare January 17, 2021 10:44

rohitgr7 added 3 commits January 17, 2021 16:18

pep

15e09b0

correct rebase

c51f946

gitignore

2c8ed93

rohitgr7 mentioned this pull request Jan 18, 2021

[WIP] Check monitor for checkpoints every epoch even if there is no validation #4793

Closed

11 tasks

rohitgr7 added 6 commits January 24, 2021 00:31

ref

5879528

add tests

2e6c601

Merge branch 'master' into bugfix/ep_end_ckpt

d38dba4

Merge branch 'master' into bugfix/ep_end_ckpt

465a6f4

fix

b3d601f

add early stopping test

d84996a

rohitgr7 changed the title ~~Separate epoch validation from step validation [skip ci]~~ Separate epoch validation from step validation Jan 25, 2021

trigger

549eb89

rohitgr7 added bug Something isn't working callback checkpointing Related to checkpointing and removed has conflicts labels Jan 25, 2021

log

465579b

carmocca approved these changes Feb 1, 2021

View reviewed changes

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

Apply suggestions from code review

40bf21b

Co-authored-by: Carlos Mocholí <[email protected]>

mergify bot added the has conflicts label Feb 3, 2021

Merge branch 'master' into bugfix/ep_end_ckpt

305d8f9

mergify bot removed the has conflicts label Feb 5, 2021

tchaton reviewed Feb 5, 2021

View reviewed changes

rohitgr7 commented Feb 5, 2021

View reviewed changes

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

Update pytorch_lightning/trainer/training_loop.py

6feabac

rohitgr7 commented Feb 5, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

cffe27f

Borda previously requested changes Feb 5, 2021

View reviewed changes

Borda added the ready PRs ready to be merged label Feb 5, 2021

rohitgr7 requested review from Borda and kaushikb11 February 7, 2021 09:21

kaushikb11 approved these changes Feb 7, 2021

View reviewed changes

Merge branch 'master' into bugfix/ep_end_ckpt

65d0797

Borda reviewed Feb 8, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Borda reviewed Feb 8, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Apply suggestions from code review

42276ad

Borda enabled auto-merge (squash) February 8, 2021 07:51

date

bcdd9ed

Borda self-requested a review February 8, 2021 08:00

Borda merged commit e429f97 into master Feb 8, 2021

Borda deleted the bugfix/ep_end_ckpt branch February 8, 2021 08:35

ananthsub mentioned this pull request Apr 9, 2021

[wip] Debug order of evaluation execution #6914

Closed

11 tasks

ananthsub mentioned this pull request Apr 29, 2021

[2/2] Remove training loop force calling early stopping callback #7069

Merged

11 tasks

		call.on_epoch_end(trainer, model),
		call.on_train_epoch_end(trainer, model, ANY),

Separate epoch validation from step validation #5208

Separate epoch validation from step validation #5208

Uh oh!

Conversation

rohitgr7 commented Dec 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

pep8speaks commented Dec 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-02-08 08:00:12 UTC

Uh oh!

codecov bot commented Dec 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

carmocca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

tchaton Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

rohitgr7 Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

ananthsub Apr 9, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Borda Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

rohitgr7 Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rohitgr7 commented Dec 20, 2020 •

edited

Loading

pep8speaks commented Dec 20, 2020 •

edited

Loading

codecov bot commented Dec 20, 2020 •

edited

Loading

rohitgr7 Feb 5, 2021 •

edited

Loading