-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Separate epoch validation from step validation #5208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @rohitgr7! Thanks for updating this PR.
Comment last updated at 2021-02-08 08:00:12 UTC |
Codecov Report
@@ Coverage Diff @@
## master #5208 +/- ##
=======================================
- Coverage 93% 89% -4%
=======================================
Files 134 134
Lines 10053 10051 -2
=======================================
- Hits 9399 8980 -419
- Misses 654 1071 +417 |
215b954 to
0cc0254
Compare
carmocca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. Overall awesome!
Co-authored-by: Carlos Mocholí <[email protected]>
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR ! Small question.
| # reset stage to train | ||
| self.trainer.logger_connector.set_stage("train") | ||
|
|
||
| should_skip_eval = self.trainer.evaluation_loop.should_skip_evaluation(self.trainer.num_val_batches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about this part. Can you explain why we check val and then decide if we should skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just to check whether there is any validation datasets available or not. If there isn't then we should run train_only_check else not. There are two cases for no validation, one when there is no validation_step other one when there is a validation_step but no validation_batches. Since even if we have a validation_step but no validation_batches, it used to skip the train_only_check, but ideally it should not.
Resolves: #4603 issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohitgr7 I'm debugging an issue right now related to this.
pytorch_lightning/trainer/training_loop.py:753: Input params: batch_idx=1, is_last_batch=True, on_epoch=True
pytorch_lightning/trainer/training_loop.py:755: batch_idx+1=2, trainer.val_check_batch=2, is_val_check_batch=True
pytorch_lightning/trainer/training_loop.py:758: current_epoch+1=1, trainer.check_val_every_n_epoch=1, is_val_check_epoch=True
pytorch_lightning/trainer/training_loop.py:761: enable_validation=True, is_val_check_epoch=True, can_check_val=True
pytorch_lightning/trainer/training_loop.py:765: is_last_batch=True, trainer.val_check_batch=2, is_last_batch_for_infinite_dataset=False
pytorch_lightning/trainer/training_loop.py:768: batch_idx + 1=2, trainer.num_training_batches=2, epoch_end_val_check=True
pytorch_lightning/trainer/training_loop.py:774: is_val_check_batch=True, is_val_check_epoch=True, can_check_val=True, is_last_batch_for_infinite_dataset=False, epoch_end_val_check=True, should_check_val=True
pytorch_lightning/trainer/training_loop.py:775: should_check_val=True, can_check_val=True
pytorch_lightning/trainer/training_loop.py:487: should_check_val=True
pytorch_lightning/trainer/training_loop.py:489: should_skip_eval=True, trainer.num_val_batches=[]
this check for should_skip_eval is forcing the should_train_only to be True, which causes the checkpoint callback to run before validation. The checkpoint is configured for a metric that appears only in validation, which leads to a failure. I don't get why should_skip_eval affects the should_train_only - shouldn't that be decided entirely by self.trainer.disable_validation ?
this could also be pointing to a bug in how self.trainer.num_val_batches is set
| call.on_epoch_end(trainer, model), | ||
| call.on_train_epoch_end(trainer, model, ANY), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @williamFalcon had a point some time ago about training shall be till validation, and the example was with validation multiple times over long training...
cc: @tchaton @PyTorchLightning/core-contributors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it still works like that only if val_check_interval < 1.0 or it an int where val_check_interval % num_training_batches != 0. But if it is set to 1.0 then validation here happens after training_epoch because we create checkpoints in on_validation_end and epoch level learning rates are updated once training is done since in case of ReduceLROnPlateau we need to have the monitor metrics and they are only available after complete training is done in case monitor is training specific.
* Seperate epoch validaton from step validation * update system * test * baked logic in callbacks * unbake logic in callbacks * fix the call for scheduler * use property * pep * correct rebase * gitignore * ref * add tests * fix * add early stopping test * trigger * chlog * rev * 1.3 * log * Apply suggestions from code review Co-authored-by: Carlos Mocholí <[email protected]> * Update pytorch_lightning/trainer/training_loop.py * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: chaton <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit e429f97)
* Seperate epoch validaton from step validation * update system * test * baked logic in callbacks * unbake logic in callbacks * fix the call for scheduler * use property * pep * correct rebase * gitignore * ref * add tests * fix * add early stopping test * trigger * chlog * rev * 1.3 * log * Apply suggestions from code review Co-authored-by: Carlos Mocholí <[email protected]> * Update pytorch_lightning/trainer/training_loop.py * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: chaton <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit e429f97)
What does this PR do?
Fixes #4603
Fixes #4655
Fixes #4797
Fixes #5156
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃