-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[WIP] Check monitor for checkpoints every epoch even if there is no validation #4793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ferrine it seems it breaks a few tests, but first we shall agree on this API change... |
|
doesn't this line of code calls the checkpoint if there is no validation if not can you create a reproducible notebook using bug_report_model |
awaelchli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is incorrect. the tests are failing for good reason :)
As @rohitgr7 said, let's find out what the actual problem is in your use case.
|
@awaelchli, @rohitgr7 Thanks for the review, I think I have an idea where the error happens. I'll revert the commit and add my failing use case as the next step |
This reverts commit 42b1f4e
Codecov Report
@@ Coverage Diff @@
## master #4793 +/- ##
======================================
Coverage 93% 93%
======================================
Files 135 135
Lines 10007 10012 +5
======================================
+ Hits 9341 9346 +5
Misses 666 666 |
@ferrine I think the main issue is here: the checkpoint callback is supposed to be called after each validation generally. This can be called multiple times within an epoch or at the epoch end. The problem here is that the call that is supposed to be made at the end of the epoch, with validation, happens here only so #4793 (comment) is added to do a checkpoint when there is no validation at the epoch end. Another way we can handle this is by imposing a condition that if a One solution I suggest a better way to handle this is by allowing checkpoint required within an epoch to happen here, otherwise the checkpoint that is supposed to happen at the end of the epoch should happen outside this loop with no conditions related to overridden stuff at all. For this we need to refactor the below method in two different methods., one for step related and another one for epoch related. This will solve a few underlying issues I am aware of right now. Not sure if it will break anything. |
|
Could one do the following: Run the checkpoint on every step (train, val, test regardless). Internally in ModelCheckpoint, keep track of last global step and/or last epoch when we saved a checkpoint to avoid checkpointing multiple times on the same step. Then this logic would be independent of training loop and entirely live inside ModelCheckpoint. Whether or not a checkpoint gets saved will be determined by a) is monitor key available in logged metrics b) have we saved on the current global step already c) ... maybe more conditions ? |
@awaelchli |
|
Any update on this PR ? |
|
@ferrine seems that the PR is empty... is it intended or some other problems with GH? |
|
There are no problems with the PR itself, I usually plan to code for weekends. Traditional lack of time because of 5/2 dayjob:)) |
|
Hey, this seems to be ready for review. I also noticed some inconsistency in |
yes this is correct and needs to stay this way. |
| if self.save_top_k is None and self.monitor is not None: | ||
| self.save_top_k = 1 | ||
|
|
||
| def _valid_monitor_key(self, trainer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe merge together this function and _is_valid_monitor_key?
|
|
||
| # when user ALSO asked for the 'last.ckpt' change the name | ||
| if self.save_last: | ||
| rank_zero_info("Saving latest checkpoint...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message will appear every time the checkpoint logic is run.
We only want it to appear the last time (as it did before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a condition to call this info only in on_training_end
| global_step = trainer.global_step | ||
| should_save = not ( | ||
| # negative conditions | ||
| self.save_top_k == 0 or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About formatting: black is not strictly enforced but generally used in this project. The issue here is that you put your logical operators at the end whereas black puts them at the beginning. This means that this function will look quite different after somebody comes around and does automatic formatting.
Can you do it yourself? black -S -l 120 model_checkpoint.py and then fix the comment positions and other undesired changes in formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, black was used before, but CI formatter had a different opinion. I'll figure it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI formatter probably complained about this: https://www.flake8rules.com/rules/W503.html
But the rule is outdated as mentioned in the link
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
|
I'll be back from winter holidays in a week 🙂 |
|
Hey @ferrine, Any updates ? Best, |
|
Yes, I'm back from mountains just yesterday. I'll be back to the PR on
weekends 🙂
…On Mon, 11 Jan 2021, 16:05 chaton, ***@***.***> wrote:
Hey @ferrine <https://github.com/ferrine>,
Any updates ?
Best,
T.C
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4793 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACZJX3VEXO3YX7L7FRJBLNDSZLZQ3ANCNFSM4T44WWPA>
.
|
|
I've merged master into this branch and observe a bunch of horovod related errors in some checks, is that ok? |
Yes, we have some issues with Horovod, mostly using torch==1.3. They should be fixed in the feature branch. This makes me think, is this PR more of a bugfix or a feature? It seems to me like it is a feature in which case it should point to the |
| ) | ||
|
|
||
| # when no val loop is present or fast-dev-run still need to call checkpoints | ||
| self.check_checkpoint_callback(not (should_check_val or is_overridden('validation_step', model))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just fix your usecase by adding sum(self.trainer.num_val_batches) == 0 here. working on a PR #5208 fixing more issues there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest yes since doing a bit of a refactor there to fix more issues. Your use-case is already fixed there. Mind check if it works for you??
|
@carmocca it's a bug since this is a case we missed for no validation checkpoint. |
|
Anything I can help with?
…On Mon, 18 Jan 2021, 15:37 Rohit Gupta, ***@***.***> wrote:
@carmocca <https://github.com/carmocca> it's a bug since this is a case
we missed for no validation checkpoint.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4793 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACZJX3T4ABFJJHDGHWWSRATS2QTQNANCNFSM4T44WWPA>
.
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
What does this PR do?
Check monitor for checkpoints every epoch even if there is no validation
Fixes #4603
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃