Skip to content

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Apr 9, 2021

What does this PR do?

Debugging an issue referenced here: #5208

pytorch_lightning/trainer/training_loop.py:753: Input params: batch_idx=1, is_last_batch=True, on_epoch=True
pytorch_lightning/trainer/training_loop.py:755: batch_idx+1=2, trainer.val_check_batch=2, is_val_check_batch=True
pytorch_lightning/trainer/training_loop.py:758: current_epoch+1=1, trainer.check_val_every_n_epoch=1, is_val_check_epoch=True
pytorch_lightning/trainer/training_loop.py:761: enable_validation=True, is_val_check_epoch=True, can_check_val=True
pytorch_lightning/trainer/training_loop.py:765: is_last_batch=True, trainer.val_check_batch=2, is_last_batch_for_infinite_dataset=False
pytorch_lightning/trainer/training_loop.py:768: batch_idx + 1=2, trainer.num_training_batches=2, epoch_end_val_check=True
pytorch_lightning/trainer/training_loop.py:774: is_val_check_batch=True, is_val_check_epoch=True, can_check_val=True, is_last_batch_for_infinite_dataset=False, epoch_end_val_check=True, should_check_val=True
pytorch_lightning/trainer/training_loop.py:775: should_check_val=True, can_check_val=True
pytorch_lightning/trainer/training_loop.py:487: should_check_val=True
pytorch_lightning/trainer/training_loop.py:489: should_skip_eval=True, trainer.num_val_batches=[]
this check for should_skip_eval is forcing the should_train_only to be True, which causes the checkpoint callback to run before validation. The checkpoint is configured for a metric that appears only in validation, which leads to a failure. I don't get why should_skip_eval affects the should_train_only - shouldn't that be decided entirely by self.trainer.disable_validation ?

this could also be pointing to a bug in how self.trainer.num_val_batches is set

this trainer.num_val_batches check is forcing the loop to think that it should both run validation and that it's train only. the checks here are not mutually exclusive which allows this. the end error is that we force save the checkpoint to run under if should_train_only but the checkpoint is configured to use a monitor for a metric that's logged only during validation.

And ideally we drop this special-case logic entirely! this should be configurable on the callbacks!

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@ananthsub ananthsub closed this Apr 13, 2021
@ananthsub ananthsub deleted the swap-eval-order branch April 13, 2021 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant