Skip to content

Conversation

@ferrine
Copy link

@ferrine ferrine commented Nov 20, 2020

What does this PR do?

Check monitor for checkpoints every epoch even if there is no validation

Fixes #4603

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added checkpointing Related to checkpointing feature Is an improvement or enhancement design Includes a design discussion labels Nov 20, 2020
@Borda
Copy link
Collaborator

Borda commented Nov 20, 2020

@ferrine it seems it breaks a few tests, but first we shall agree on this API change...
cc: @PyTorchLightning/core-contributors

@rohitgr7
Copy link
Contributor

doesn't this line of code calls the checkpoint if there is no validation
https://github.com/PyTorchLightning/pytorch-lightning/blob/94a9d3d2837eb962cb47ad2854569039a552f729/pytorch_lightning/trainer/training_loop.py#L614-L615

if not can you create a reproducible notebook using bug_report_model

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is incorrect. the tests are failing for good reason :)
As @rohitgr7 said, let's find out what the actual problem is in your use case.

@awaelchli awaelchli marked this pull request as draft November 21, 2020 13:42
@ferrine
Copy link
Author

ferrine commented Nov 21, 2020

@awaelchli, @rohitgr7 Thanks for the review, I think I have an idea where the error happens. I'll revert the commit and add my failing use case as the next step

@codecov
Copy link

codecov bot commented Nov 22, 2020

Codecov Report

Merging #4793 (0540544) into master (7f352cb) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4793   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         135     135           
  Lines       10007   10012    +5     
======================================
+ Hits         9341    9346    +5     
  Misses        666     666           

@ferrine
Copy link
Author

ferrine commented Nov 22, 2020

@rohitgr7 I've created a notebook reproducing the error

TL;DR Error reproduces if the following conditions hold

  • You override validation_step
  • You do not provide val_dataloader

@rohitgr7
Copy link
Contributor

@rohitgr7 I've created a notebook reproducing the error

TL;DR Error reproduces if the following conditions hold

* You override validation_step

* You do not provide val_dataloader

@ferrine
That's why it doesn't hit this condition.
#4793 (comment)

I think the main issue is here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/trainer/training_loop.py#L565-L569

the checkpoint callback is supposed to be called after each validation generally. This can be called multiple times within an epoch or at the epoch end. The problem here is that the call that is supposed to be made at the end of the epoch, with validation, happens here only so #4793 (comment) is added to do a checkpoint when there is no validation at the epoch end.

Another way we can handle this is by imposing a condition that if a validation_step is overridden then a val_dataloader should be there with an exception otherwise. But this is not convenient so there is just a warning for that.

One solution I suggest a better way to handle this is by allowing checkpoint required within an epoch to happen here, otherwise the checkpoint that is supposed to happen at the end of the epoch should happen outside this loop with no conditions related to overridden stuff at all. For this we need to refactor the below method in two different methods., one for step related and another one for epoch related. This will solve a few underlying issues I am aware of right now.
https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/trainer/training_loop.py#L842-L851

Not sure if it will break anything.

@awaelchli
Copy link
Contributor

awaelchli commented Nov 23, 2020

Could one do the following: Run the checkpoint on every step (train, val, test regardless). Internally in ModelCheckpoint, keep track of last global step and/or last epoch when we saved a checkpoint to avoid checkpointing multiple times on the same step. Then this logic would be independent of training loop and entirely live inside ModelCheckpoint. Whether or not a checkpoint gets saved will be determined by a) is monitor key available in logged metrics b) have we saved on the current global step already c) ... maybe more conditions ?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Nov 23, 2020

a) is monitor key available in logged metrics

@awaelchli
if we implement this logic and someone doesn't log the metric specified in the monitor, it should give a proper warning or an exception. How will you distinguish this condition in these two different cases?

@tchaton
Copy link
Contributor

tchaton commented Nov 30, 2020

Any update on this PR ?

@Borda
Copy link
Collaborator

Borda commented Nov 30, 2020

@ferrine seems that the PR is empty... is it intended or some other problems with GH?

@tchaton tchaton added the waiting on author Waiting on user action, correction, or update label Dec 3, 2020
@ferrine
Copy link
Author

ferrine commented Dec 6, 2020

There are no problems with the PR itself, I usually plan to code for weekends. Traditional lack of time because of 5/2 dayjob:))

@ferrine
Copy link
Author

ferrine commented Dec 20, 2020

Hey, this seems to be ready for review. I also noticed some inconsistency in rank_zero usage: the callback is supposed to be used with rank zero only, but a lot of methods are evaluated on other ranks. Any suggestions on what I should do with them?

@awaelchli
Copy link
Contributor

but a lot of methods are evaluated on other ranks. Any suggestions on what I should do with them?

yes this is correct and needs to stay this way.
the state of the object needs to be kept in sync, even if rank > 0 does not save anything to disk.

if self.save_top_k is None and self.monitor is not None:
self.save_top_k = 1

def _valid_monitor_key(self, trainer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe merge together this function and _is_valid_monitor_key?


# when user ALSO asked for the 'last.ckpt' change the name
if self.save_last:
rank_zero_info("Saving latest checkpoint...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message will appear every time the checkpoint logic is run.

We only want it to appear the last time (as it did before)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a condition to call this info only in on_training_end

global_step = trainer.global_step
should_save = not (
# negative conditions
self.save_top_k == 0 or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About formatting: black is not strictly enforced but generally used in this project. The issue here is that you put your logical operators at the end whereas black puts them at the beginning. This means that this function will look quite different after somebody comes around and does automatic formatting.

Can you do it yourself? black -S -l 120 model_checkpoint.py and then fix the comment positions and other undesired changes in formatting

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, black was used before, but CI formatter had a different opinion. I'll figure it out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI formatter probably complained about this: https://www.flake8rules.com/rules/W503.html

But the rule is outdated as mentioned in the link

@stale
Copy link

stale bot commented Jan 4, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jan 4, 2021
@ferrine
Copy link
Author

ferrine commented Jan 4, 2021

I'll be back from winter holidays in a week 🙂

@stale stale bot removed the won't fix This will not be worked on label Jan 4, 2021
@ferrine ferrine changed the title Check monitor for checkpoints every epoch even if there is no validation [[WIP] Check monitor for checkpoints every epoch even if there is no validation Jan 4, 2021
@ferrine ferrine changed the title [[WIP] Check monitor for checkpoints every epoch even if there is no validation [WIP] Check monitor for checkpoints every epoch even if there is no validation Jan 4, 2021
@tchaton
Copy link
Contributor

tchaton commented Jan 11, 2021

Hey @ferrine,

Any updates ?

Best,
T.C

@ferrine
Copy link
Author

ferrine commented Jan 11, 2021 via email

@ferrine
Copy link
Author

ferrine commented Jan 16, 2021

I've merged master into this branch and observe a bunch of horovod related errors in some checks, is that ok?

@carmocca
Copy link
Contributor

I've merged master into this branch and observe a bunch of horovod related errors in some checks, is that ok?

Yes, we have some issues with Horovod, mostly using torch==1.3. They should be fixed in the feature branch.

This makes me think, is this PR more of a bugfix or a feature? It seems to me like it is a feature in which case it should point to the release/1.2-dev branch. cc: @Borda

)

# when no val loop is present or fast-dev-run still need to call checkpoints
self.check_checkpoint_callback(not (should_check_val or is_overridden('validation_step', model)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just fix your usecase by adding sum(self.trainer.num_val_batches) == 0 here. working on a PR #5208 fixing more issues there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohitgr7, does it make sense to close the issue in #5208, not here? I'm fine with closing the PR if a nicer solution is proposed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest yes since doing a bit of a refactor there to fix more issues. Your use-case is already fixed there. Mind check if it works for you??

@rohitgr7
Copy link
Contributor

@carmocca it's a bug since this is a case we missed for no validation checkpoint.

@ferrine
Copy link
Author

ferrine commented Jan 18, 2021 via email

@stale
Copy link

stale bot commented Feb 10, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Feb 10, 2021
@rohitgr7 rohitgr7 closed this Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

checkpointing Related to checkpointing design Includes a design discussion feature Is an improvement or enhancement has conflicts waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ModelCheckpoint misbehaves when no validation

7 participants