[wip] Debug order of evaluation execution #6914

ananthsub · 2021-04-09T09:02:58Z

What does this PR do?

Debugging an issue referenced here: #5208

pytorch_lightning/trainer/training_loop.py:753: Input params: batch_idx=1, is_last_batch=True, on_epoch=True
pytorch_lightning/trainer/training_loop.py:755: batch_idx+1=2, trainer.val_check_batch=2, is_val_check_batch=True
pytorch_lightning/trainer/training_loop.py:758: current_epoch+1=1, trainer.check_val_every_n_epoch=1, is_val_check_epoch=True
pytorch_lightning/trainer/training_loop.py:761: enable_validation=True, is_val_check_epoch=True, can_check_val=True
pytorch_lightning/trainer/training_loop.py:765: is_last_batch=True, trainer.val_check_batch=2, is_last_batch_for_infinite_dataset=False
pytorch_lightning/trainer/training_loop.py:768: batch_idx + 1=2, trainer.num_training_batches=2, epoch_end_val_check=True
pytorch_lightning/trainer/training_loop.py:774: is_val_check_batch=True, is_val_check_epoch=True, can_check_val=True, is_last_batch_for_infinite_dataset=False, epoch_end_val_check=True, should_check_val=True
pytorch_lightning/trainer/training_loop.py:775: should_check_val=True, can_check_val=True
pytorch_lightning/trainer/training_loop.py:487: should_check_val=True
pytorch_lightning/trainer/training_loop.py:489: should_skip_eval=True, trainer.num_val_batches=[]
this check for should_skip_eval is forcing the should_train_only to be True, which causes the checkpoint callback to run before validation. The checkpoint is configured for a metric that appears only in validation, which leads to a failure. I don't get why should_skip_eval affects the should_train_only - shouldn't that be decided entirely by self.trainer.disable_validation ?

this could also be pointing to a bug in how self.trainer.num_val_batches is set

this trainer.num_val_batches check is forcing the loop to think that it should both run validation and that it's train only. the checks here are not mutually exclusive which allows this. the end error is that we force save the checkpoint to run under if should_train_only but the checkpoint is configured to use a monitor for a metric that's logged only during validation.

And ideally we drop this special-case logic entirely! this should be configurable on the callbacks!

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Update training_loop.py

f2454f9

ananthsub closed this Apr 13, 2021

ananthsub deleted the swap-eval-order branch April 13, 2021 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] Debug order of evaluation execution #6914

[wip] Debug order of evaluation execution #6914

Uh oh!

ananthsub commented Apr 9, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[wip] Debug order of evaluation execution #6914

[wip] Debug order of evaluation execution #6914

Uh oh!

Conversation

ananthsub commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ananthsub commented Apr 9, 2021 •

edited

Loading