-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 #6075
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6075 +/- ##
========================================
- Coverage 93% 33% -60%
========================================
Files 160 160
Lines 11428 11319 -109
========================================
- Hits 10677 3733 -6944
- Misses 751 7586 +6835 |
Borda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mind ad chlog
|
Hello @SkafteNicki! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-02-24 10:20:28 UTC |
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
carmocca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @SkafteNicki @rohitgr7 !
…_interval < 1.0 (Lightning-AI#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]>
…_interval < 1.0 (Lightning-AI#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]>
…_interval < 1.0 (Lightning-AI#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]>
…_interval < 1.0 (#6075) * fix bug * fix tests * changelog * fix pep8 * fix tests * fix and add some tests * add test for rlop * chlog * Update CHANGELOG.md Co-authored-by: rohitgr7 <[email protected]>
| if should_train_only: | ||
| self.check_checkpoint_callback(True) | ||
| self.check_early_stopping_callback(True) | ||
|
|
||
| if should_check_val: | ||
| self.trainer.run_evaluation(on_epoch=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these checks guaranteed to be mutually exclusive? i'm updating to 1.2.4 and see a test failure with one of my modules where it looks like there's a gap:
before: run_evaluation ran first, allowing the module to run the validation loop. inside validation step/validation epoch end, the module could log metrics. afterward, the checkpoint is force-run. so if the checkpoint configured a metric that was available only during validation, then somehow this still worked.
by moving the run_evaluation to happen after the force checkpoint saving, we fail when looking up the monitor value
i think the correct fix is dropping the training loop force running checkpoint/early stopping callbacks here. they should be part of the callback, but that's a longer term thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe if validation is disabled or val batches=0 then only we run the force checkpoint else it will call run_evaluation and the the callbacks will be called inside and monitor will be taken care of as expected. Can you paste a small example with your case where it doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, see #7207
this PR introduced the change where we call the evaluation loop after we call the checkpoint/early stopping callbacks from training.
As a result, this check for should_train_only is incomplete - it inherently depends on the evaluation loop to populate num_val_batches correctly. in run_evaluation we set the validation dataloader, but this is too late as the validation dataloader is what's used to determine should_skip_eval above
What does this PR do?
Currently, epoch level learning rate schedulers are updated after each validation epoch, when it should be after each training epoch. This can be seen by adding
val_check_interval=0.5to many of our scheduler test, and they will fail since the learning rate gets updated twice per epoch. This PR fixes it.Edit: actually the bug seems to be that epoch level learning rate schedulers never seems to be called when
val_check_interval!=1, so the test fails because the learning rate is unaltered.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃