Skip to content

Using reload_dataloaders_every_epoch=True and num_sanity_val_steps=0 can lead to the validation loop being skipped #7208

@ananthsub

Description

@ananthsub

🐛 Bug

Inside the training loop, we incorrectly skip running evaluation when reload_dataloaders_every_epoch=True and num_sanity_val_steps=0. With these settings, we defer setting the validation dataloader on the trainer until the evaluation loop is run from inside the training loop. However, this is too late as the training loop depends on the validation dataloader settings being set in order to even determine whether we run the evaluation loop at all.

This means it's possible to have these states set inside of the training loop when determining whether to run the evaluation loop:

is_last_batch=True
should_check_val=True
num_val_batches=[]
should_skip_eval=True
disable_validation=False
should_train_only=True

should_skip_eval=True when self.trainer.num_val_batches isn't set: In this instance trainer.num_val_batches=[] .
https://github.com/PyTorchLightning/pytorch-lightning/blob/44d775fccfb825561937f6fa03fe258af25c2b83/pytorch_lightning/trainer/training_loop.py#L551

This points out that should_check_val and should_train_only were not consistent with each other :(

#6075 changed the order with which we call run_evaluation inside the training loop. Before, this was covered up by luck because of the ordering. After the swap occurred there, this has been broken.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1z9ln3gYBK-VGidNPdUE2UgE0ISAgjLpu?usp=sharing

To Reproduce

Use following BoringModel and post here

Expected behavior

Checkpointing should still work as expected because we run the evaluation loop when expected

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions