-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[blocked by #6997]Consolidate Training End Model Checkpoint #6671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[blocked by #6997]Consolidate Training End Model Checkpoint #6671
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
…oint_consolidate Update test_all_gather_grad.py
This reverts commit 9d4a2b8.
This reverts commit 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
This reverts commit ea74906.
This reverts commit bf70e43.
This reverts commit f172101.
This reverts commit 536c132.
This reverts commit 3a9fde9.
This reverts commit 7a369f4.
This reverts commit 8222dc9.
This reverts commit 6c095b2.
This reverts commit 250d0aa.
This reverts commit 8651d54.
This reverts commit dcdcd29.
|
@ananthsub , do you think the following could be better?
and for this trigger, we only allow |
| def on_train_end(self, trainer, *args, **kwargs) -> None: | ||
| """ | ||
| checkpoints can be saved at the end of the trianing | ||
| """ | ||
| if not self._trigger_on_train_end: | ||
| return | ||
| # as we advance one step at end of training, we use global_step - 1 | ||
| # to avoid saving duplicates | ||
| trainer.global_step -= 1 | ||
| if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained): | ||
| if self.save_last and self.verbose: | ||
| rank_zero_info("Saving last checkpoint...") | ||
| self.save_checkpoint(trainer, is_on_train_end=True) | ||
| trainer.global_step += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shuyingsunshine21 could this directly call self._save_last_checkpoint ? I think the most common case will be for saving a last.ckpt file at the end of training. this way we don't thread through the is_on_train_end flag everywhere
@carmocca what do you think? would this go along with #6470 ?
| if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained): | ||
| if self.save_last and self.verbose: | ||
| rank_zero_info("Saving last checkpoint...") | ||
| self.save_checkpoint(trainer, is_on_train_end=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this not directly call self._save_last_checkpoint ? I think the most common case will be for saving a last.ckpt file at the end of training. this way we don't thread through the is_on_train_end flag everywhere
we could directly call self._save_last_checkpoint by ignoring the topK setup.
One thing to discuss is if trigger_on_train_end is set, should we guarantee to save last.ckpt even if save_last is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should respect what's set on the callback. The other reason is if we have multiple checkpoint callbacks, we don't need them all to save on train end. We'll configure only one of them to have save_last=True
| def on_train_end(self, trainer, *args, **kwargs) -> None: | ||
| """ | ||
| checkpoints can be saved at the end of the trianing | ||
| """ | ||
| if not self._trigger_on_train_end: | ||
| return | ||
| # as we advance one step at end of training, we use global_step - 1 | ||
| # to avoid saving duplicates | ||
| trainer.global_step -= 1 | ||
| if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained): | ||
| if self.save_last and self.verbose: | ||
| rank_zero_info("Saving last checkpoint...") | ||
| monitor_candidates = self._monitor_candidates(trainer) | ||
| self._save_last_checkpoint(trainer, monitor_candidates) | ||
| trainer.global_step += 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def on_train_end(self, trainer, *args, **kwargs) -> None: | |
| """ | |
| checkpoints can be saved at the end of the trianing | |
| """ | |
| if not self._trigger_on_train_end: | |
| return | |
| # as we advance one step at end of training, we use global_step - 1 | |
| # to avoid saving duplicates | |
| trainer.global_step -= 1 | |
| if (not self._should_skip_saving_checkpoint(trainer) and trainer.checkpoint_connector.has_trained): | |
| if self.save_last and self.verbose: | |
| rank_zero_info("Saving last checkpoint...") | |
| monitor_candidates = self._monitor_candidates(trainer) | |
| self._save_last_checkpoint(trainer, monitor_candidates) | |
| trainer.global_step += 1 | |
| def on_train_end(self, trainer, pl_module) -> None: | |
| """Save a checkpoint at the very end of training. | |
| This will only save a checkpoint if `save_last` is also enabled | |
| as the monitor metrics produced by training or validation steps or end of epochs | |
| is not guaranteed to be available at this stage. | |
| """ | |
| if self._should_skip_saving_checkpoint(trainer) or not trainer.checkpoint_connector.has_trained: | |
| return | |
| initial_save_last = self.save_last | |
| if self._save_on_train_end and not self.save_last: | |
| rank_zero_warn( | |
| "Requested to save a checkpoint at the end of training but save_last is not set. Temporarily setting save_last=True to save." | |
| ) | |
| self.save_last = True | |
| if self.verbose: | |
| rank_zero_info("Saving last checkpoint...") | |
| # as we advance one step at end of training, we use global_step - 1 | |
| # to avoid saving duplicates | |
| trainer.global_step -= 1 | |
| monitor_candidates = self._monitor_candidates(trainer) | |
| self._save_last_checkpoint(trainer, monitor_candidates) | |
| trainer.global_step += 1 | |
| self.save_last = initial_save_last | |
what do you think of this?
also what should happen if save_last is not set to True? should save on train end take precedence and temporarily override it? should we move the save_last check out of _save_last_checkpoint so the property needs to be checked first before we call save_last_checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the original thought is save_on_train_end is dependent on save_last, so only enabled when save_last is set also. What you proposed is to always enable is regardless of save_last. To make save_on_train_end as an independent triggering, makes sense also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carmocca @awaelchli what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the current implementation, maybe throwing a warning so people know they should set both.
|
Do we need the |
|
I opened #7724 to have the full picture of what would be necessary to remove So let's keep this open for now. You can hold off on updating it and I can hijack it later and do it myself if necessary. Thanks! |
What does this PR do?
Note: as #6997 will fix the global_step and current epoch for training end, it will be useful for this PR. will rebase after that is checked in.
Master Issue: #6672
This is to consolidate the part for model checkpointing at the end of training.
Currently, we checkpoint based on hook
on_validation_end, it islimit_train_batchesflag #6332) or the scenario when training failed and validation loop not called, validation metric is set as monitor (see related issue Errors within try/except of train(self) are misrepresented as checkpointing MisconfigurationException #5766)(Note: for end of each training epoch consolidation, need some dependency cleanup, will be in separate PR)
What this PR do
model_checkpointhookon_train_endevery_n_val_epochs, we provide optiontrigger_on_train_endto determine whether checkpoint. By default, it is turned off.trigger_on_train_endis turned on, to address the issue whenmonitorvalue is missing, we relax the condition for checking existence of monitor key at end of training. In such case, we fall back to save last.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃