deprecate passing ModelCheckpoint instance to Trainer(checkpoint_callback=...) #4336

awaelchli · 2020-10-24T22:51:52Z

What does this PR do?

Pitch: Deprecate passing ModelCheckpoint instance directly to checkpoint_callbacks argument. Allow only bool.

FAQ

Q: Why can we not just remove the checkpoint_callback argument all together?
A: Lightning philosophy is to provide good defaults, this includes checkpointing. We need a way to turn it off though, i.e., checkpoint_callback=False

Q: Why can't we keep it the way it is?
A: When we add Trainer(checkpoint_callback=mycustomcallback), internally Trainer adds it to self.callbacks.
A problem arises when we reload the Trainer using resume_from_checkpoint="...". In this case, we would end up with a conflict and a decision needs to be made: Do we load the callback that was saved in the checkpoint, or do we overwrite it with the one that is passed in to Trainer. We can try to handle this case and make the decision for the user (#4027) or we go with this PR here to remove the ambiguity alltogether.

Alternatives

Drop checkpoint_callback. This implies no default checkpointing.
Suggestion by @jeremyjordan: Remove checkpoint_callback and if trainer.callbacks is empty, add a ModelCheckpoint automatically. If any other callbacks need to be passed in, we need to supply a ModelCheckpoint manually.
Keep as is, go with fix resume_from_checkpoint adding extra ModelCheckpoint to callbacks list #4027 for the fix, document well the behaviour when resuming from checkpoint.

TODOs for this PR:

update docs -> follow up PR

Fixes #4014 automatically
Maybe Fixes #4386 (need to reproduce first)
Closes #3990
Closes #4027
Related to #4335, as it is a good step in to the direction of allowing multiple model checkpoints.

pep8speaks · 2020-10-24T23:15:11Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-30 03:12:23 UTC

pytorch_lightning/trainer/connectors/callback_connector.py

carmocca · 2020-10-27T03:18:10Z

pytorch_lightning/trainer/trainer.py

        self,
        logger: Union[LightningLoggerBase, Iterable[LightningLoggerBase], bool] = True,
-        checkpoint_callback: Union[ModelCheckpoint, bool] = True,
+        checkpoint_callback: bool = True,


Q: Why can we not just remove the checkpoint_callback argument all together?
A: Lightning philosophy is to provide good defaults, this includes checkpointing. We need a way to turn it off though, i.e., checkpoint_callback=False

What about keeping the checkpoint_callback parameter but changing its default value to False? I think it will be pretty annoying to have to set checkpoint_callback=False every time you pass a custom ModelCheckpoint via callbacks. And I think most people use a custom ModelCheckpoint instead of just checkpoint_callback=True

What about keeping the checkpoint_callback parameter but changing its default value to False?

doesn't solve the problem I'm trying to solve here, which is eliminate ambiguity when restoring the state of trainer. see answer of 2nd FAQ question.

pretty annoying to have to set checkpoint_callback=False every time you pass a custom ModelCheckpoint

with this PR proposal, the value will be ignored if you pass in a custom one. False is only needed when you want to disable checkpointing completely. I believe I have this covered in a test.

i agree with @carmocca , this is super confusing when adding my own checkpoint callback. given how loose the default checkpoint callback is, and with the coming customizations, I'd rather drop the checkpoint_callback arg altogether and force everything to be configured through callbacks. Given the callback implementation already exists, I personally don't think it's much of a request for people to instantiate the checkpoint callback (and confirm their settings while doing so) and pass it along to the trainer.

I also think that's a nice message for users: "See how extensible this framework is" vs "look at all the magic this trainer configures for you which you can't change"

Even if that's not in this PR, it feels inevitable that checkpoint_callback=False would eventually be the new default and then later we could drop the arg altogether

yes fine with me, I don't have strong preference here.
Note that this PR does NOT close the disussion on whether there should be a checkpoint_callback arg or not.
I'm simply restricting what can be passed to the argument.

It looks like a lot of api change, but it is really more a bugfix.

tests/models/test_restore.py

edenlightning · 2020-10-27T16:59:57Z

@williamFalcon @teddykoker remove ambiguity by making checkpoint_callback boolean. sg?

awaelchli · 2020-10-28T20:50:19Z

pytorch_lightning/tuner/batch_size_scaling.py

    trainer.weights_summary = None  # not needed before full run
    trainer.logger = DummyLogger()
    trainer.callbacks = []  # not needed before full run
-    trainer.checkpoint_callback = False  # required for saving


I removed these from Tuner because ModelCheckpoint now entirely lives in callbacks list, and this is properly backed up by Tuner already.

cc @SkafteNicki

codecov · 2020-10-28T21:03:43Z

Codecov Report

Merging #4336 into master will increase coverage by 2%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #4336    +/-   ##
=======================================
+ Coverage      91%     93%    +2%     
=======================================
  Files         113     111     -2     
  Lines        8323    8138   -185     
=======================================
- Hits         7588    7563    -25     
+ Misses        735     575   -160

tests/models/test_restore.py

Co-authored-by: Carlos Mocholí <[email protected]>

pytorch_lightning/trainer/trainer.py

Co-authored-by: Jeff Yang <[email protected]>

s-rog

So the change makes checkpoint_callback a bool, and to add args to it (monitor, top_k, dir, etc...) a custom one is to be passed into trainer with callback=[checkpoint_callback]?

If so, the warning in trainer docs: Only user defined callbacks (ie: Not ModelCheckpoint) need to be removed as well (trainer/init.py line 490)

awaelchli · 2020-10-30T00:55:21Z

@s-rog Yes correct. I have already updated these docs you mention on another branch, which will follow up to this. I didn't want to make the PR too big, so keeping docs updates separate.

…back=...) (#4336) * first attempt * update tests * support multiple * test bugfix * changelog * pep * pep * import order * import * improve test for resuming * test * update test * add references test Co-authored-by: Carlos Mocholí <[email protected]> * docstring suggestion deprecation Co-authored-by: Jeff Yang <[email protected]> * paramref Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Jeff Yang <[email protected]> (cherry picked from commit d1234c5)

awaelchli added 2 commits October 25, 2020 00:15

first attempt

8d87170

update tests

d764359

awaelchli added bug Something isn't working discussion In a discussion stage checkpointing Related to checkpointing callback labels Oct 24, 2020

support multiple

4442537

carmocca reviewed Oct 27, 2020

View reviewed changes

pytorch_lightning/trainer/connectors/callback_connector.py Show resolved Hide resolved

carmocca reviewed Oct 27, 2020

View reviewed changes

tests/models/test_restore.py Show resolved Hide resolved

awaelchli added this to the 1.0.x milestone Oct 28, 2020

awaelchli added 2 commits October 28, 2020 20:55

Merge branch 'master' into feature/ckpt-callback-bool

a0c7e68

test bugfix

7318e87

awaelchli changed the title ~~[wip] deprecate passing ModelCheckpoint instance to Trainer(checkpoint_callback=...) [skip ci]~~ deprecate passing ModelCheckpoint instance to Trainer(checkpoint_callback=...) Oct 28, 2020

awaelchli added 5 commits October 28, 2020 20:59

changelog

f8a0b7e

pep

923b3e1

pep

66c3406

import order

80fd987

import

78693cc

awaelchli commented Oct 28, 2020

View reviewed changes

awaelchli marked this pull request as ready for review October 28, 2020 20:50

awaelchli requested review from Borda, SeanNaren, ananyahjha93, justusschock, nateraw, tchaton and teddykoker as code owners October 28, 2020 20:50

awaelchli added 3 commits October 29, 2020 00:21

improve test for resuming

e1271e1

test

616fa06

update test

c2235b9

This was referenced Oct 29, 2020

resume_from_checkpoint loads duplicate ModelCheckpoint #4014

Closed

fix resume_from_checkpoint adding extra ModelCheckpoint to callbacks list #4027

Closed

ananthsub approved these changes Oct 29, 2020

View reviewed changes

Lightning-AI deleted a comment from awaelchli Oct 29, 2020

carmocca reviewed Oct 29, 2020

View reviewed changes

tests/models/test_restore.py Show resolved Hide resolved

carmocca approved these changes Oct 29, 2020

View reviewed changes

add references test

d2f8791

Co-authored-by: Carlos Mocholí <[email protected]>

ydcjeff reviewed Oct 29, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

justusschock approved these changes Oct 29, 2020

View reviewed changes

awaelchli removed the discussion In a discussion stage label Oct 29, 2020

awaelchli mentioned this pull request Oct 29, 2020

After trainer resume_from_checkpoint checkpoint_callback doesn't restore best metric value #4386

Closed

awaelchli added the priority: 0 High priority task label Oct 29, 2020

awaelchli and others added 2 commits October 29, 2020 22:07

docstring suggestion deprecation

fd02011

Co-authored-by: Jeff Yang <[email protected]>

paramref

c507898

s-rog reviewed Oct 30, 2020

View reviewed changes

s-rog approved these changes Oct 30, 2020

View reviewed changes

ydcjeff approved these changes Oct 30, 2020

View reviewed changes

Merge branch 'master' into feature/ckpt-callback-bool

d680605

awaelchli merged commit d1234c5 into master Oct 30, 2020

awaelchli deleted the feature/ckpt-callback-bool branch October 30, 2020 03:47

edenlightning added the hacktoberfest-accepted label Oct 30, 2020

awaelchli mentioned this pull request Oct 31, 2020

update docs on checkpoint_callback Trainer argument #4461

Merged

1 task

edenlightning mentioned this pull request Nov 13, 2020

Clarify checkpoint deprecation message #4640

Merged

ananthsub mentioned this pull request May 14, 2021

[feat] Add stronger validation for checkpoint_callback argument #7539

Merged

11 tasks

deprecate passing ModelCheckpoint instance to Trainer(checkpoint_callback=...) #4336

deprecate passing ModelCheckpoint instance to Trainer(checkpoint_callback=...) #4336

Uh oh!

Conversation

awaelchli commented Oct 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

FAQ

Alternatives

TODOs for this PR:

Uh oh!

pep8speaks commented Oct 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-10-30 03:12:23 UTC

Uh oh!

Uh oh!

carmocca Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

awaelchli Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

ananthsub Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awaelchli Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edenlightning commented Oct 27, 2020

Uh oh!

awaelchli Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awaelchli Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

s-rog left a comment

Choose a reason for hiding this comment

Uh oh!

awaelchli commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

awaelchli commented Oct 24, 2020 •

edited

Loading

pep8speaks commented Oct 24, 2020 •

edited

Loading

ananthsub Oct 29, 2020 •

edited

Loading

awaelchli Oct 29, 2020 •

edited

Loading

awaelchli Oct 28, 2020 •

edited

Loading

codecov bot commented Oct 28, 2020 •

edited

Loading

awaelchli commented Oct 30, 2020 •

edited

Loading