[WIP] Check monitor for checkpoints every epoch even if there is no validation #4793

ferrine · 2020-11-20T15:14:45Z

What does this PR do?

Check monitor for checkpoints every epoch even if there is no validation

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

Borda · 2020-11-20T16:27:38Z

@ferrine it seems it breaks a few tests, but first we shall agree on this API change...
cc: @PyTorchLightning/core-contributors

rohitgr7 · 2020-11-20T18:36:38Z

doesn't this line of code calls the checkpoint if there is no validation
https://github.com/PyTorchLightning/pytorch-lightning/blob/94a9d3d2837eb962cb47ad2854569039a552f729/pytorch_lightning/trainer/training_loop.py#L614-L615

if not can you create a reproducible notebook using bug_report_model

awaelchli

this change is incorrect. the tests are failing for good reason :)
As @rohitgr7 said, let's find out what the actual problem is in your use case.

ferrine · 2020-11-21T19:41:19Z

@awaelchli, @rohitgr7 Thanks for the review, I think I have an idea where the error happens. I'll revert the commit and add my failing use case as the next step

This reverts commit 42b1f4e

codecov · 2020-11-22T07:49:07Z

Codecov Report

Merging #4793 (0540544) into master (7f352cb) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4793   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         135     135           
  Lines       10007   10012    +5     
======================================
+ Hits         9341    9346    +5     
  Misses        666     666

ferrine · 2020-11-22T08:32:03Z

@rohitgr7 I've created a notebook reproducing the error

TL;DR Error reproduces if the following conditions hold

You override validation_step
You do not provide val_dataloader

rohitgr7 · 2020-11-22T13:33:40Z

@rohitgr7 I've created a notebook reproducing the error

TL;DR Error reproduces if the following conditions hold
* You override validation_step

* You do not provide val_dataloader

@ferrine
That's why it doesn't hit this condition.
#4793 (comment)

I think the main issue is here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/trainer/training_loop.py#L565-L569

the checkpoint callback is supposed to be called after each validation generally. This can be called multiple times within an epoch or at the epoch end. The problem here is that the call that is supposed to be made at the end of the epoch, with validation, happens here only so #4793 (comment) is added to do a checkpoint when there is no validation at the epoch end.

Another way we can handle this is by imposing a condition that if a validation_step is overridden then a val_dataloader should be there with an exception otherwise. But this is not convenient so there is just a warning for that.

One solution I suggest a better way to handle this is by allowing checkpoint required within an epoch to happen here, otherwise the checkpoint that is supposed to happen at the end of the epoch should happen outside this loop with no conditions related to overridden stuff at all. For this we need to refactor the below method in two different methods., one for step related and another one for epoch related. This will solve a few underlying issues I am aware of right now.
https://github.com/PyTorchLightning/pytorch-lightning/blob/8601268c70649f49767001098adbf665a93843df/pytorch_lightning/trainer/training_loop.py#L842-L851

Not sure if it will break anything.

awaelchli · 2020-11-23T04:44:32Z

Could one do the following: Run the checkpoint on every step (train, val, test regardless). Internally in ModelCheckpoint, keep track of last global step and/or last epoch when we saved a checkpoint to avoid checkpointing multiple times on the same step. Then this logic would be independent of training loop and entirely live inside ModelCheckpoint. Whether or not a checkpoint gets saved will be determined by a) is monitor key available in logged metrics b) have we saved on the current global step already c) ... maybe more conditions ?

rohitgr7 · 2020-11-23T12:41:32Z

a) is monitor key available in logged metrics

@awaelchli
if we implement this logic and someone doesn't log the metric specified in the monitor, it should give a proper warning or an exception. How will you distinguish this condition in these two different cases?

tchaton · 2020-11-30T14:39:06Z

Any update on this PR ?

Borda · 2020-11-30T21:23:06Z

@ferrine seems that the PR is empty... is it intended or some other problems with GH?

ferrine · 2020-12-06T06:42:03Z

There are no problems with the PR itself, I usually plan to code for weekends. Traditional lack of time because of 5/2 dayjob:))

ferrine · 2020-12-20T09:09:43Z

Hey, this seems to be ready for review. I also noticed some inconsistency in rank_zero usage: the callback is supposed to be used with rank zero only, but a lot of methods are evaluated on other ranks. Any suggestions on what I should do with them?

awaelchli · 2020-12-20T09:37:54Z

but a lot of methods are evaluated on other ranks. Any suggestions on what I should do with them?

yes this is correct and needs to stay this way.
the state of the object needs to be kept in sync, even if rank > 0 does not save anything to disk.

…heckpoint

carmocca · 2020-12-20T15:08:34Z

pytorch_lightning/callbacks/model_checkpoint.py

        if self.save_top_k is None and self.monitor is not None:
            self.save_top_k = 1

+    def _valid_monitor_key(self, trainer):


Maybe merge together this function and _is_valid_monitor_key?

carmocca · 2020-12-20T15:16:36Z

pytorch_lightning/callbacks/model_checkpoint.py


        # when user ALSO asked for the 'last.ckpt' change the name
        if self.save_last:
+            rank_zero_info("Saving latest checkpoint...")


This message will appear every time the checkpoint logic is run.

We only want it to appear the last time (as it did before)

I've added a condition to call this info only in on_training_end

carmocca · 2020-12-20T15:24:18Z

pytorch_lightning/callbacks/model_checkpoint.py

+        global_step = trainer.global_step
+        should_save = not (
+            # negative conditions
+            self.save_top_k == 0 or


About formatting: black is not strictly enforced but generally used in this project. The issue here is that you put your logical operators at the end whereas black puts them at the beginning. This means that this function will look quite different after somebody comes around and does automatic formatting.

Can you do it yourself? black -S -l 120 model_checkpoint.py and then fix the comment positions and other undesired changes in formatting

Yeah, black was used before, but CI formatter had a different opinion. I'll figure it out

The CI formatter probably complained about this: https://www.flake8rules.com/rules/W503.html

But the rule is outdated as mentioned in the link

stale · 2021-01-04T10:56:50Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

ferrine · 2021-01-04T12:38:28Z

I'll be back from winter holidays in a week 🙂

tchaton · 2021-01-11T13:05:01Z

Hey @ferrine,

Any updates ?

Best,
T.C

ferrine · 2021-01-11T16:05:21Z

Yes, I'm back from mountains just yesterday. I'll be back to the PR on weekends 🙂

…

On Mon, 11 Jan 2021, 16:05 chaton, ***@***.***> wrote: Hey @ferrine <https://github.com/ferrine>, Any updates ? Best, T.C — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4793 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACZJX3VEXO3YX7L7FRJBLNDSZLZQ3ANCNFSM4T44WWPA> .

ferrine · 2021-01-16T09:08:04Z

I've merged master into this branch and observe a bunch of horovod related errors in some checks, is that ok?

carmocca · 2021-01-18T12:08:07Z

I've merged master into this branch and observe a bunch of horovod related errors in some checks, is that ok?

Yes, we have some issues with Horovod, mostly using torch==1.3. They should be fixed in the feature branch.

This makes me think, is this PR more of a bugfix or a feature? It seems to me like it is a feature in which case it should point to the release/1.2-dev branch. cc: @Borda

rohitgr7 · 2021-01-18T12:35:45Z

pytorch_lightning/trainer/training_loop.py

        )

-        # when no val loop is present or fast-dev-run still need to call checkpoints
-        self.check_checkpoint_callback(not (should_check_val or is_overridden('validation_step', model)))


you can just fix your usecase by adding sum(self.trainer.num_val_batches) == 0 here. working on a PR #5208 fixing more issues there.

@rohitgr7, does it make sense to close the issue in #5208, not here? I'm fine with closing the PR if a nicer solution is proposed

I'd suggest yes since doing a bit of a refactor there to fix more issues. Your use-case is already fixed there. Mind check if it works for you??

rohitgr7 · 2021-01-18T12:37:09Z

@carmocca it's a bug since this is a case we missed for no validation checkpoint.

ferrine · 2021-01-18T12:39:21Z

Anything I can help with?

…

On Mon, 18 Jan 2021, 15:37 Rohit Gupta, ***@***.***> wrote: @carmocca <https://github.com/carmocca> it's a bug since this is a case we missed for no validation checkpoint. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4793 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACZJX3T4ABFJJHDGHWWSRATS2QTQNANCNFSM4T44WWPA> .

stale · 2021-02-10T17:22:22Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

change validation_end to epoch_end

42b1f4e

ferrine requested review from Borda, SeanNaren, ananyahjha93, awaelchli, justusschock, nateraw, tchaton, teddykoker and williamFalcon as code owners November 20, 2020 15:14

Borda added checkpointing Related to checkpointing feature Is an improvement or enhancement design Includes a design discussion labels Nov 20, 2020

awaelchli suggested changes Nov 21, 2020

View reviewed changes

awaelchli marked this pull request as draft November 21, 2020 13:42

Revert "change validation_end to epoch_end"

7f12291

This reverts commit 42b1f4e

Merge branch 'master' into fix-4603

b2e06bb

Merge branch 'master' into fix-4603

e4e825a

tchaton added the waiting on author Waiting on user action, correction, or update label Dec 3, 2020

ferrine added 2 commits December 20, 2020 11:48

a better comment

92d3b4a

zero rank save

bb47a22

do not use rank_zero\n _save_model may have side effects in on_save_c…

b3e4c5a

…heckpoint

carmocca reviewed Dec 20, 2020

View reviewed changes

black a piece of code

0225426

stale bot added the won't fix This will not be worked on label Jan 4, 2021

stale bot removed the won't fix This will not be worked on label Jan 4, 2021

ferrine changed the title ~~Check monitor for checkpoints every epoch even if there is no validation~~ [[WIP] Check monitor for checkpoints every epoch even if there is no validation Jan 4, 2021

ferrine changed the title ~~[[WIP] Check monitor for checkpoints every epoch even if there is no validation~~ [WIP] Check monitor for checkpoints every epoch even if there is no validation Jan 4, 2021

github-actions bot added the has conflicts label Jan 12, 2021

ferrine added 2 commits January 16, 2021 10:49

Merge branch 'master' into fix-4603

b880ad8

info about saving latest checkpoint only in on_train_end callback stage

0540544

rohitgr7 reviewed Jan 18, 2021

View reviewed changes

mergify bot added has conflicts and removed has conflicts labels Jan 21, 2021

stale bot added the won't fix This will not be worked on label Feb 10, 2021

rohitgr7 closed this Feb 10, 2021

[WIP] Check monitor for checkpoints every epoch even if there is no validation #4793

[WIP] Check monitor for checkpoints every epoch even if there is no validation #4793

Uh oh!

Conversation

ferrine commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Borda commented Nov 20, 2020

Uh oh!

rohitgr7 commented Nov 20, 2020

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

ferrine commented Nov 21, 2020

Uh oh!

codecov bot commented Nov 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ferrine commented Nov 22, 2020

Uh oh!

rohitgr7 commented Nov 22, 2020

Uh oh!

awaelchli commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohitgr7 commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tchaton commented Nov 30, 2020

Uh oh!

Borda commented Nov 30, 2020

Uh oh!

ferrine commented Dec 6, 2020

Uh oh!

ferrine commented Dec 20, 2020

Uh oh!

awaelchli commented Dec 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Jan 4, 2021

Uh oh!

ferrine commented Jan 4, 2021

Uh oh!

tchaton commented Jan 11, 2021

Uh oh!

ferrine commented Jan 11, 2021 via email

Uh oh!

ferrine commented Jan 16, 2021

Uh oh!

carmocca commented Jan 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohitgr7 commented Jan 18, 2021

Uh oh!

ferrine commented Jan 18, 2021 via email

Uh oh!

stale bot commented Feb 10, 2021

Uh oh!

ferrine commented Nov 20, 2020 •

edited

Loading

codecov bot commented Nov 22, 2020 •

edited

Loading

awaelchli commented Nov 23, 2020 •

edited

Loading

rohitgr7 commented Nov 23, 2020 •

edited

Loading