[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075

SkafteNicki · 2021-02-19T08:22:05Z

What does this PR do?

Currently, epoch level learning rate schedulers are updated after each validation epoch, when it should be after each training epoch. This can be seen by adding val_check_interval=0.5 to many of our scheduler test, and they will fail since the learning rate gets updated twice per epoch. This PR fixes it.

Edit: actually the bug seems to be that epoch level learning rate schedulers never seems to be called when val_check_interval!=1, so the test fails because the learning rate is unaltered.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-02-19T08:24:20Z

Codecov Report

Merging #6075 (cede66a) into master (1d9c553) will decrease coverage by 60%.
The diff coverage is 11%.

@@           Coverage Diff            @@
##           master   #6075     +/-   ##
========================================
- Coverage      93%     33%    -60%     
========================================
  Files         160     160             
  Lines       11428   11319    -109     
========================================
- Hits        10677    3733   -6944     
- Misses        751    7586   +6835

Borda

mind ad chlog

pytorch_lightning/trainer/trainer.py

pytorch_lightning/trainer/training_loop.py

pep8speaks · 2021-02-19T10:55:44Z

Hello @SkafteNicki! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-24 10:20:28 UTC

tests/checkpointing/test_model_checkpoint.py

tchaton

LGTM !

carmocca

Awesome @SkafteNicki @rohitgr7 !

…_interval < 1.0 (Lightning-AI#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]>

…_interval < 1.0 (#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]>

ananthsub · 2021-03-22T11:41:33Z

pytorch_lightning/trainer/training_loop.py

+        if should_train_only:
            self.check_checkpoint_callback(True)
            self.check_early_stopping_callback(True)

+        if should_check_val:
+            self.trainer.run_evaluation(on_epoch=True)


are these checks guaranteed to be mutually exclusive? i'm updating to 1.2.4 and see a test failure with one of my modules where it looks like there's a gap:
before: run_evaluation ran first, allowing the module to run the validation loop. inside validation step/validation epoch end, the module could log metrics. afterward, the checkpoint is force-run. so if the checkpoint configured a metric that was available only during validation, then somehow this still worked.
by moving the run_evaluation to happen after the force checkpoint saving, we fail when looking up the monitor value

i think the correct fix is dropping the training loop force running checkpoint/early stopping callbacks here. they should be part of the callback, but that's a longer term thing.

I believe if validation is disabled or val batches=0 then only we run the force checkpoint else it will call run_evaluation and the the callbacks will be called inside and monitor will be taken care of as expected. Can you paste a small example with your case where it doesn't work?

It's not, see #7207

this PR introduced the change where we call the evaluation loop after we call the checkpoint/early stopping callbacks from training.

As a result, this check for should_train_only is incomplete - it inherently depends on the evaluation loop to populate num_val_batches correctly. in run_evaluation we set the validation dataloader, but this is too late as the validation dataloader is what's used to determine should_skip_eval above

SkafteNicki added the bug Something isn't working label Feb 19, 2021

SkafteNicki added this to the 1.2.x milestone Feb 19, 2021

SkafteNicki requested review from Borda, SeanNaren, awaelchli, carmocca, justusschock, tchaton and williamFalcon as code owners February 19, 2021 08:22

Borda added the priority: 1 Medium priority task label Feb 19, 2021

Borda approved these changes Feb 19, 2021

View reviewed changes

SkafteNicki requested a review from rohitgr7 February 19, 2021 09:10

SkafteNicki changed the title ~~[Bugfix] Update learning rate schedulers on train epoch and not val epoch~~ [WIP,Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 19, 2021

rohitgr7 reviewed Feb 19, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

rohitgr7 reviewed Feb 19, 2021

View reviewed changes

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

rohitgr7 reviewed Feb 19, 2021

View reviewed changes

tests/checkpointing/test_model_checkpoint.py Outdated Show resolved Hide resolved

SkafteNicki changed the title ~~[WIP,Bugfix] Update learning rate schedulers on train epoch and not val epoch~~ [Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 19, 2021

tchaton approved these changes Feb 19, 2021

View reviewed changes

tchaton enabled auto-merge (squash) February 19, 2021 12:29

rohitgr7 disabled auto-merge February 19, 2021 12:38

Borda added the ready PRs ready to be merged label Feb 19, 2021

rohitgr7 removed the ready PRs ready to be merged label Feb 19, 2021

mergify bot added the has conflicts label Feb 22, 2021

Borda approved these changes Feb 22, 2021

View reviewed changes

Borda enabled auto-merge (squash) February 22, 2021 14:17

Borda added the ready PRs ready to be merged label Feb 22, 2021

mergify bot removed the has conflicts label Feb 22, 2021

rohitgr7 changed the title ~~[Bugfix] Update learning rate schedulers on train epoch and not val epoch~~ [WIP] [Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 22, 2021

rohitgr7 changed the title ~~[WIP] [Bugfix] Update learning rate schedulers on train epoch and not val epoch~~ [Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 22, 2021

rohitgr7 requested review from Borda and tchaton February 23, 2021 13:04

carmocca approved these changes Feb 23, 2021

View reviewed changes

rohitgr7 changed the title ~~[Bugfix] Update learning rate schedulers on train epoch and not val epoch~~ [Bugfix] Fixed epoch level schedulers not being called when val_check_interval != 1 Feb 23, 2021

chlog

f596ab5

rohitgr7 changed the title ~~[Bugfix] Fixed epoch level schedulers not being called when val_check_interval != 1~~ [Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 Feb 23, 2021

mergify bot added the has conflicts label Feb 23, 2021

SkafteNicki added 2 commits February 24, 2021 11:19

Merge branch 'master' into lr_scheduler_bugfix

bc19f34

Update CHANGELOG.md

cede66a

mergify bot removed the has conflicts label Feb 24, 2021

rohitgr7 merged commit 1b498d1 into Lightning-AI:master Feb 24, 2021

kaushikb11 mentioned this pull request Mar 2, 2021

1.2.x cherries 🍒 #6083

Closed

SkafteNicki deleted the lr_scheduler_bugfix branch March 2, 2021 15:55

kaushikb11 mentioned this pull request Mar 2, 2021

Release v1.2.2 [full merge, no squash] #6304

Merged

11 tasks

ananthsub reviewed Mar 22, 2021

View reviewed changes

This was referenced Apr 26, 2021

[fix] Attach train+val dataloaders to trainer in trainer loop #7207

Merged

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075

[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075

Uh oh!

SkafteNicki commented Feb 19, 2021 •

edited by Borda

Loading

Uh oh!

codecov bot commented Feb 19, 2021 •

edited

Loading

Uh oh!

Borda left a comment

Uh oh!

Uh oh!

Uh oh!

pep8speaks commented Feb 19, 2021 •

edited

Loading

Uh oh!

Uh oh!

tchaton left a comment

Uh oh!

carmocca left a comment

Uh oh!

ananthsub Mar 22, 2021 •

edited

Loading

Uh oh!

rohitgr7 Mar 22, 2021

Uh oh!

ananthsub Apr 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075

[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075

Uh oh!

Conversation

SkafteNicki commented Feb 19, 2021 • edited by Borda Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Borda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pep8speaks commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-02-24 10:20:28 UTC

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

carmocca left a comment

Choose a reason for hiding this comment

Uh oh!

ananthsub Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohitgr7 Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

ananthsub Apr 26, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SkafteNicki commented Feb 19, 2021 •

edited by Borda

Loading

codecov bot commented Feb 19, 2021 •

edited

Loading

pep8speaks commented Feb 19, 2021 •

edited

Loading

ananthsub Mar 22, 2021 •

edited

Loading