Skip to content

Conversation

@SkafteNicki
Copy link
Collaborator

@SkafteNicki SkafteNicki commented Feb 19, 2021

What does this PR do?

Currently, epoch level learning rate schedulers are updated after each validation epoch, when it should be after each training epoch. This can be seen by adding val_check_interval=0.5 to many of our scheduler test, and they will fail since the learning rate gets updated twice per epoch. This PR fixes it.

Edit: actually the bug seems to be that epoch level learning rate schedulers never seems to be called when val_check_interval!=1, so the test fails because the learning rate is unaltered.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@SkafteNicki SkafteNicki added the bug Something isn't working label Feb 19, 2021
@SkafteNicki SkafteNicki added this to the 1.2.x milestone Feb 19, 2021
@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #6075 (cede66a) into master (1d9c553) will decrease coverage by 60%.
The diff coverage is 11%.

@@           Coverage Diff            @@
##           master   #6075     +/-   ##
========================================
- Coverage      93%     33%    -60%     
========================================
  Files         160     160             
  Lines       11428   11319    -109     
========================================
- Hits        10677    3733   -6944     
- Misses        751    7586   +6835     

@Borda Borda added the priority: 1 Medium priority task label Feb 19, 2021
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind ad chlog

@SkafteNicki SkafteNicki requested a review from rohitgr7 February 19, 2021 09:10
@SkafteNicki SkafteNicki changed the title [Bugfix] Update learning rate schedulers on train epoch and not val epoch [WIP,Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 19, 2021
@pep8speaks
Copy link

pep8speaks commented Feb 19, 2021

Hello @SkafteNicki! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-24 10:20:28 UTC

@SkafteNicki SkafteNicki changed the title [WIP,Bugfix] Update learning rate schedulers on train epoch and not val epoch [Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 19, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@tchaton tchaton enabled auto-merge (squash) February 19, 2021 12:29
@rohitgr7 rohitgr7 disabled auto-merge February 19, 2021 12:38
@Borda Borda added the ready PRs ready to be merged label Feb 19, 2021
@rohitgr7 rohitgr7 removed the ready PRs ready to be merged label Feb 19, 2021
@Borda Borda enabled auto-merge (squash) February 22, 2021 14:17
@Borda Borda added the ready PRs ready to be merged label Feb 22, 2021
@mergify mergify bot removed the has conflicts label Feb 22, 2021
@rohitgr7 rohitgr7 changed the title [Bugfix] Update learning rate schedulers on train epoch and not val epoch [WIP] [Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 22, 2021
@rohitgr7 rohitgr7 changed the title [WIP] [Bugfix] Update learning rate schedulers on train epoch and not val epoch [Bugfix] Update learning rate schedulers on train epoch and not val epoch Feb 22, 2021
@rohitgr7 rohitgr7 requested review from Borda and tchaton February 23, 2021 13:04
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohitgr7 rohitgr7 changed the title [Bugfix] Update learning rate schedulers on train epoch and not val epoch [Bugfix] Fixed epoch level schedulers not being called when val_check_interval != 1 Feb 23, 2021
@rohitgr7 rohitgr7 changed the title [Bugfix] Fixed epoch level schedulers not being called when val_check_interval != 1 [Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 Feb 23, 2021
@mergify mergify bot removed the has conflicts label Feb 24, 2021
@rohitgr7 rohitgr7 merged commit 1b498d1 into Lightning-AI:master Feb 24, 2021
ananthsub pushed a commit to ananthsub/pytorch-lightning that referenced this pull request Feb 24, 2021
…_interval < 1.0 (Lightning-AI#6075)

* fix bug

* fix tests

* changelog

* fix pep8

* fix tests

* fix and add some tests

* add test for rlop

* chlog

* Update CHANGELOG.md

Co-authored-by: rohitgr7 <[email protected]>
kaushikb11 pushed a commit to kaushikb11/pytorch-lightning that referenced this pull request Mar 2, 2021
…_interval < 1.0 (Lightning-AI#6075)

* fix bug

* fix tests

* changelog

* fix pep8

* fix tests

* fix and add some tests

* add test for rlop

* chlog

* Update CHANGELOG.md

Co-authored-by: rohitgr7 <[email protected]>
@kaushikb11 kaushikb11 mentioned this pull request Mar 2, 2021
@SkafteNicki SkafteNicki deleted the lr_scheduler_bugfix branch March 2, 2021 15:55
kaushikb11 pushed a commit to kaushikb11/pytorch-lightning that referenced this pull request Mar 2, 2021
…_interval < 1.0 (Lightning-AI#6075)

* fix bug

* fix tests

* changelog

* fix pep8

* fix tests

* fix and add some tests

* add test for rlop

* chlog

* Update CHANGELOG.md

Co-authored-by: rohitgr7 <[email protected]>
lexierule pushed a commit that referenced this pull request Mar 5, 2021
…_interval < 1.0 (#6075)

* fix bug

* fix tests

* changelog

* fix pep8

* fix tests

* fix and add some tests

* add test for rlop

* chlog

* Update CHANGELOG.md

Co-authored-by: rohitgr7 <[email protected]>
Comment on lines +570 to +575
if should_train_only:
self.check_checkpoint_callback(True)
self.check_early_stopping_callback(True)

if should_check_val:
self.trainer.run_evaluation(on_epoch=True)
Copy link
Contributor

@ananthsub ananthsub Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these checks guaranteed to be mutually exclusive? i'm updating to 1.2.4 and see a test failure with one of my modules where it looks like there's a gap:
before: run_evaluation ran first, allowing the module to run the validation loop. inside validation step/validation epoch end, the module could log metrics. afterward, the checkpoint is force-run. so if the checkpoint configured a metric that was available only during validation, then somehow this still worked.
by moving the run_evaluation to happen after the force checkpoint saving, we fail when looking up the monitor value

i think the correct fix is dropping the training loop force running checkpoint/early stopping callbacks here. they should be part of the callback, but that's a longer term thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if validation is disabled or val batches=0 then only we run the force checkpoint else it will call run_evaluation and the the callbacks will be called inside and monitor will be taken care of as expected. Can you paste a small example with your case where it doesn't work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, see #7207

this PR introduced the change where we call the evaluation loop after we call the checkpoint/early stopping callbacks from training.

As a result, this check for should_train_only is incomplete - it inherently depends on the evaluation loop to populate num_val_batches correctly. in run_evaluation we set the validation dataloader, but this is too late as the validation dataloader is what's used to determine should_skip_eval above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working priority: 1 Medium priority task ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants