-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Inside the training loop, we incorrectly skip running evaluation when reload_dataloaders_every_epoch=True and num_sanity_val_steps=0. With these settings, we defer setting the validation dataloader on the trainer until the evaluation loop is run from inside the training loop. However, this is too late as the training loop depends on the validation dataloader settings being set in order to even determine whether we run the evaluation loop at all.
This means it's possible to have these states set inside of the training loop when determining whether to run the evaluation loop:
is_last_batch=True
should_check_val=True
num_val_batches=[]
should_skip_eval=True
disable_validation=False
should_train_only=True
should_skip_eval=True when self.trainer.num_val_batches isn't set: In this instance trainer.num_val_batches=[] .
https://github.com/PyTorchLightning/pytorch-lightning/blob/44d775fccfb825561937f6fa03fe258af25c2b83/pytorch_lightning/trainer/training_loop.py#L551
This points out that should_check_val and should_train_only were not consistent with each other :(
#6075 changed the order with which we call run_evaluation inside the training loop. Before, this was covered up by luck because of the ordering. After the swap occurred there, this has been broken.
Please reproduce using the BoringModel
https://colab.research.google.com/drive/1z9ln3gYBK-VGidNPdUE2UgE0ISAgjLpu?usp=sharing
To Reproduce
Use following BoringModel and post here
Expected behavior
Checkpointing should still work as expected because we run the evaluation loop when expected
Environment
Note: Bugs with code are solved faster ! Colab Notebook should be made public !
-
IDE: Please, use our python bug_report_model.py template. -
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information: