[blocked by #6997]Consolidate Training End Model Checkpoint #6671

shuyingsunshine21 · 2021-03-25T08:21:28Z

What does this PR do?

Note: as #6997 will fix the global_step and current epoch for training end, it will be useful for this PR. will rebase after that is checked in.
Master Issue: #6672

This is to consolidate the part for model checkpointing at the end of training.

Currently, we checkpoint based on hook on_validation_end, it is

confusing
might cause bug when checkpoint is based on validation metric but we limit_train_batches which prevents validation loop being called. (see related issue also: Validation not called when using an IterableDataset and limit_train_batches flag #6332) or the scenario when training failed and validation loop not called, validation metric is set as monitor (see related issue Errors within try/except of train(self) are misrepresented as checkpointing MisconfigurationException #5766)

(Note: for end of each training epoch consolidation, need some dependency cleanup, will be in separate PR)

What this PR do

move end of training checkpoint logic from training loop to model_checkpoint hook on_train_end
instead of relying on every_n_val_epochs, we provide option trigger_on_train_end to determine whether checkpoint. By default, it is turned off.
When trigger_on_train_end is turned on, to address the issue when monitor value is missing, we relax the condition for checking existence of monitor key at end of training. In such case, we fall back to save last.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…lightning pull latest code

…oint_consolidate Update test_all_gather_grad.py

This reverts commit 9d4a2b8.

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

This reverts commit 0d23d75.

This reverts commit 70fe5da.

This reverts commit a9aae99.

This reverts commit ea74906.

This reverts commit bf70e43.

This reverts commit f172101.

This reverts commit 536c132.

This reverts commit 3a9fde9.

This reverts commit 7a369f4.

…lightning

This reverts commit 8222dc9.

This reverts commit 6c095b2.

This reverts commit 250d0aa.

This reverts commit 8651d54.

This reverts commit dcdcd29.

…-lightning

…lightning into new-test

shuyingsunshine21 · 2021-04-12T22:59:39Z

@ananthsub , do you think the following could be better?

trigger_on_train_end should be mutually exclusive with rest of the trigger modes (every_n_train_steps, every_n_val_epochs, ...).

and for this trigger, we only allow save_last.

ananthsub · 2021-04-13T04:48:12Z

pytorch_lightning/callbacks/model_checkpoint.py

+    def on_train_end(self, trainer, *args, **kwargs) -> None:
+        """
+        checkpoints can be saved at the end of the trianing
+        """
+        if not self._trigger_on_train_end:
+            return
+        # as we advance one step at end of training, we use global_step - 1
+        # to avoid saving duplicates
+        trainer.global_step -= 1
+        if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained):
+            if self.save_last and self.verbose:
+                rank_zero_info("Saving last checkpoint...")
+            self.save_checkpoint(trainer, is_on_train_end=True)
+        trainer.global_step += 1


@shuyingsunshine21 could this directly call self._save_last_checkpoint ? I think the most common case will be for saving a last.ckpt file at the end of training. this way we don't thread through the is_on_train_end flag everywhere

@carmocca what do you think? would this go along with #6470 ?

shuyingsunshine21 · 2021-04-13T05:33:36Z

pytorch_lightning/callbacks/model_checkpoint.py

+        if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained):
+            if self.save_last and self.verbose:
+                rank_zero_info("Saving last checkpoint...")
+            self.save_checkpoint(trainer, is_on_train_end=True)


@ananthsub ,

can this not directly call self._save_last_checkpoint ? I think the most common case will be for saving a last.ckpt file at the end of training. this way we don't thread through the is_on_train_end flag everywhere

we could directly call self._save_last_checkpoint by ignoring the topK setup.

One thing to discuss is if trigger_on_train_end is set, should we guarantee to save last.ckpt even if save_last is not set?

I think we should respect what's set on the callback. The other reason is if we have multiple checkpoint callbacks, we don't need them all to save on train end. We'll configure only one of them to have save_last=True

…lightning into new-test

pytorch_lightning/callbacks/model_checkpoint.py

ananthsub · 2021-05-02T23:01:17Z

pytorch_lightning/callbacks/model_checkpoint.py

+    def on_train_end(self, trainer, *args, **kwargs) -> None:
+        """
+        checkpoints can be saved at the end of the trianing
+        """
+        if not self._trigger_on_train_end:
+            return
+        # as we advance one step at end of training, we use global_step - 1
+        # to avoid saving duplicates
+        trainer.global_step -= 1
+        if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained):
+            if self.save_last and self.verbose:
+                rank_zero_info("Saving last checkpoint...")
+            monitor_candidates = self._monitor_candidates(trainer)
+            self._save_last_checkpoint(trainer, monitor_candidates)
+        trainer.global_step += 1
+


Suggested change

def on_train_end(self, trainer, *args, **kwargs) -> None:

"""

checkpoints can be saved at the end of the trianing

"""

if not self._trigger_on_train_end:

return

# as we advance one step at end of training, we use global_step - 1

# to avoid saving duplicates

trainer.global_step -= 1

if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained):

if self.save_last and self.verbose:

rank_zero_info("Saving last checkpoint...")

monitor_candidates = self._monitor_candidates(trainer)

self._save_last_checkpoint(trainer, monitor_candidates)

trainer.global_step += 1

def on_train_end(self, trainer, pl_module) -> None:

"""Save a checkpoint at the very end of training.

This will only save a checkpoint if `save_last` is also enabled

as the monitor metrics produced by training or validation steps or end of epochs

is not guaranteed to be available at this stage.

"""

if self._should_skip_saving_checkpoint(trainer) or not trainer.checkpoint_connector.has_trained:

return

initial_save_last = self.save_last

if self._save_on_train_end and not self.save_last:

rank_zero_warn(

"Requested to save a checkpoint at the end of training but save_last is not set. Temporarily setting save_last=True to save."

)

self.save_last = True

if self.verbose:

rank_zero_info("Saving last checkpoint...")

# as we advance one step at end of training, we use global_step - 1

# to avoid saving duplicates

trainer.global_step -= 1

monitor_candidates = self._monitor_candidates(trainer)

self._save_last_checkpoint(trainer, monitor_candidates)

trainer.global_step += 1

self.save_last = initial_save_last

what do you think of this?

also what should happen if save_last is not set to True? should save on train end take precedence and temporarily override it? should we move the save_last check out of _save_last_checkpoint so the property needs to be checked first before we call save_last_checkpoint?

i think the original thought is save_on_train_end is dependent on save_last, so only enabled when save_last is set also. What you proposed is to always enable is regardless of save_last. To make save_on_train_end as an independent triggering, makes sense also.

@carmocca @awaelchli what do you think?

I think I prefer the current implementation, maybe throwing a warning so people know they should set both.

carmocca · 2021-05-26T12:43:31Z

Do we need the trigger_on_train_end flag?

shuyingsunshine21 · 2021-05-26T18:12:47Z

Hi @carmocca , if my understanding is correct, your PR: #7724 would include this change. Maybe I could abandon this one?

carmocca · 2021-05-26T18:23:59Z

I opened #7724 to have the full picture of what would be necessary to remove check_checkpoint_callback. But once that is validated, it can be split into smaller changes including the ones in this PR (after tweaking).

So let's keep this open for now. You can hold off on updating it and I can hijack it later and do it myself if necessary.

Thanks!

Shuying Sun and others added 30 commits March 23, 2021 12:06

Fix some test errors

89f284d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

80cfbff

…lightning pull latest code

checkpoint consolidation

536c132

Update ddp_spawn.py

f172101

Update test_metric_result_integration.py

bf70e43

Update test_results.py

ea74906

Update utils.py

a9aae99

Update utils.py

70fe5da

Update test_all_gather_grad.py

0d23d75

Update test_all_gather_grad.py

ca6f98b

Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkp…

c5053da

…oint_consolidate Update test_all_gather_grad.py

Update test_results.py

9d4a2b8

Revert "Update test_results.py"

7635b4f

This reverts commit 9d4a2b8.

Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine2…

d64f90c

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

Revert "Update test_all_gather_grad.py"

dcdcd29

This reverts commit 0d23d75.

Revert "Update utils.py"

8651d54

This reverts commit 70fe5da.

Revert "Update utils.py"

15f4b9e

This reverts commit a9aae99.

Revert "Update test_results.py"

250d0aa

This reverts commit ea74906.

Revert "Update test_metric_result_integration.py"

6c095b2

This reverts commit bf70e43.

Revert "Update ddp_spawn.py"

8222dc9

This reverts commit f172101.

Revert "checkpoint consolidation"

3a9fde9

This reverts commit 536c132.

Revert "Revert "checkpoint consolidation""

7a369f4

This reverts commit 3a9fde9.

Revert "Revert "Revert "checkpoint consolidation"""

b4a0b9e

This reverts commit 7a369f4.

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

5cf1db1

…lightning

Revert "Revert "Update ddp_spawn.py""

0ce7e05

This reverts commit 8222dc9.

Revert "Revert "Update test_metric_result_integration.py""

fe9736d

This reverts commit 6c095b2.

Revert "Revert "Update test_results.py""

c314ef6

This reverts commit 250d0aa.

Revert "Revert "Update utils.py""

c3feda0

This reverts commit 8651d54.

Revert "Revert "Update test_all_gather_grad.py""

c759477

This reverts commit dcdcd29.

Merge branch 'master' of https://github.com/shuyingsunshine21/pytorch…

7a8e540

…-lightning

Shuying Sun added 3 commits April 10, 2021 00:23

add one more unittest for end of training with invalid monitor

ddf76c4

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

9f83b6f

…lightning into new-test

add changelog

48a34e8

shuyingsunshine21 marked this pull request as ready for review April 10, 2021 07:47

shuyingsunshine21 requested a review from kaushikb11 as a code owner April 10, 2021 07:47

mergify bot added the has conflicts label Apr 10, 2021

rebase

af7806e

mergify bot removed the has conflicts label Apr 11, 2021

ananthsub reviewed Apr 13, 2021

View reviewed changes

shuyingsunshine21 commented Apr 13, 2021

View reviewed changes

Shuying Sun added 2 commits April 13, 2021 02:01

comments, call _save_last_checkpoint directly for train end

f9616a3

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

2a9e882

…lightning into new-test

mergify bot added the has conflicts label Apr 14, 2021

shuyingsunshine21 changed the title ~~Consolidate Training End Model Checkpoint~~ [blocked by #6997]Consolidate Training End Model Checkpoint Apr 20, 2021

ananthsub mentioned this pull request May 2, 2021

[bugfix] Fix dataloading for iterable datasets and limit_train_batches #7306

Merged

11 tasks

ananthsub reviewed May 2, 2021

View reviewed changes

awaelchli added this to the v1.4 milestone May 3, 2021

awaelchli added checkpointing Related to checkpointing feature Is an improvement or enhancement labels May 3, 2021

carmocca mentioned this pull request May 26, 2021

Remove check_checkpoint_callback #7724

Merged

8 tasks

carmocca self-assigned this Jun 10, 2021

ananthsub mentioned this pull request Jun 30, 2021

ModelCheckpoint does not save checkpoint on training end #8126

Closed

edenlightning modified the milestones: v1.4, v1.3.x Jul 9, 2021

carmocca mentioned this pull request Jul 13, 2021

Add ModelCheckpoint(save_on_train_epoch_end) #8389

Merged

12 tasks

carmocca closed this in #8389 Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[blocked by #6997]Consolidate Training End Model Checkpoint #6671

[blocked by #6997]Consolidate Training End Model Checkpoint #6671

Uh oh!

shuyingsunshine21 commented Mar 25, 2021 •

edited

Loading

Uh oh!

shuyingsunshine21 commented Apr 12, 2021

Uh oh!

ananthsub Apr 13, 2021 •

edited

Loading

Uh oh!

shuyingsunshine21 Apr 13, 2021

Uh oh!

ananthsub Apr 13, 2021

Uh oh!

Uh oh!

ananthsub May 2, 2021

Uh oh!

shuyingsunshine21 May 3, 2021

Uh oh!

ananthsub May 3, 2021

Uh oh!

carmocca May 26, 2021

Uh oh!

carmocca commented May 26, 2021

Uh oh!

shuyingsunshine21 commented May 26, 2021

Uh oh!

carmocca commented May 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[blocked by #6997]Consolidate Training End Model Checkpoint #6671

[blocked by #6997]Consolidate Training End Model Checkpoint #6671

Uh oh!

Conversation

shuyingsunshine21 commented Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What this PR do

Before submitting

PR review

Did you have fun?

Uh oh!

shuyingsunshine21 commented Apr 12, 2021

Uh oh!

ananthsub Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuyingsunshine21 Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

ananthsub Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ananthsub May 2, 2021

Choose a reason for hiding this comment

Uh oh!

shuyingsunshine21 May 3, 2021

Choose a reason for hiding this comment

Uh oh!

ananthsub May 3, 2021

Choose a reason for hiding this comment

Uh oh!

carmocca May 26, 2021

Choose a reason for hiding this comment

Uh oh!

carmocca commented May 26, 2021

Uh oh!

shuyingsunshine21 commented May 26, 2021

Uh oh!

carmocca commented May 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

shuyingsunshine21 commented Mar 25, 2021 •

edited

Loading

ananthsub Apr 13, 2021 •

edited

Loading