Refactor RunningStage usage in advance of implementing Trainer.validate() #4945

EliaCereda · 2020-12-02T12:33:01Z

What does this PR do?

author: @carmocca. original author: @EliaCereda

Refactor RunningStage usage in advance of implementing Trainer.validate(...).

Define trainer.evaluating to check if we are on RunningStage.VALIDATING or RunningStage.TESTING
Define RunningStage.SANITY_CHECKING
Define TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
Deprecate trainer.running_sanity_check in favor of trainer.sanity_checking
Update the other components to use trainer.evaluating instead of trainer.testing
Disable the EarlyStopping and ModelCheckpoint callbacks when not TrainerState.FITTING. This has no effect when evaluating on the test set, since they were already disabled, but it will be necessary for the validation set.
Rename a few other attributes of Trainer to clarify that they will be used by both test(…) and validate(…)

Feature request issue #4634
Split from PR #4707, see for full discussion
Second PR #4948, for context

Backwards compatibility

Unfortunately, there is already trainer.evaluating to indicate if validation is being run. That property has been changed to trainer.validating. I don't see any way we can avoid this compatibility break but it should be mostly fine as we are the ones who rely on these attributes

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR

codecov · 2020-12-02T13:49:22Z

Codecov Report

Merging #4945 (c46ae62) into master (46540ee) will increase coverage by 0%.
The diff coverage is 91%.

@@          Coverage Diff           @@
##           master   #4945   +/-   ##
======================================
  Coverage      91%     92%           
======================================
  Files         161     161           
  Lines       11457   11479   +22     
======================================
+ Hits        10481   10512   +31     
+ Misses        976     967    -9

rohitgr7 · 2020-12-02T18:00:14Z

In trainer/evaluation_loop.py and accelerators/tpu_accelerator.py it still uses self.testing at some places. Can you check??

EliaCereda · 2020-12-02T19:56:18Z

You're absolutely correct about accelerators/tpu_accelerator.py, it is still using Trainer.testing. In addition, I found other usages in plugins/sharded_plugin.py, accelerators/ddp_spawn_accelerator.py, tests/model_train_steps.py and possibly in connectors/logger_connector.py. I must have missed these occurrences, tomorrow I'll review them.

On the other hand, it seems to me that evaluation_loop.py only accesses EvaluationLoop.testing which is a different attribute: its purpose is to determine whether EvaluationLoop is currently in the validation or test loop and Trainer already sets it correctly from this line. It will be False during the validation phase of fit() and in validate(), but True during test().

Let me know if I missed any usage of Trainer.testing directly in EvaluationLoop.

rohitgr7 · 2020-12-02T20:52:47Z

EvaluationLoop.testing doesn't seems to be doing anything special other than checking whether to load and call test methods or eval methods. I think we can easily do this with self.trainer.evaluating.

if self.testing:
    do something
else:
    do something else

can be:

if self.trainer.evaluating == 'test':
    do something
else:
    do something else

but need to make sure it's correct and not breaking anything else.

EliaCereda · 2020-12-02T23:21:38Z

Yes, I also think that would work. As you say, the harder part is ensuring everything works correctly, which is not trivial when the logic is intertwined over a number of files.

This is the main reason I stopped the refactoring at the attributes on Trainer.

In addition, EvaluationLoop specifically doesn’t need to distinguish fit, validate and test but only whether the current evaluation is over validation or test. This is what replacing testing with evaluating essentially makes possibile, but it would not be of much benefit here.

And Trainer is a user-facing class, where it was important to both provide a coherent API and not break existing users which were potentially already using testing or tested_checkpoint_path. With EvaluationLoop being mainly an internal object, hidden from users, maybe it makes sense to leave this as a future refactor?

Borda · 2021-01-26T14:52:55Z

hi @EliaCereda how is going here? just waiting for review?
cc: @tchaton @awaelchli

rohitgr7 · 2021-01-26T19:12:08Z

@EliaCereda I have cleaned #4945 (comment) on master. Will be available once 1.1.5 get's synced to 1.2-dev: #5583

tchaton

Awesome PR. Small comments !

pytorch_lightning/callbacks/early_stopping.py

pytorch_lightning/plugins/training_type/ddp_spawn.py

pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

pytorch_lightning/trainer/properties.py

pytorch_lightning/trainer/training_loop.py

pytorch_lightning/core/lightning.py

tests/trainer/test_trainer.py

tests/trainer/flags/test_fast_dev_run.py

pytorch_lightning/trainer/trainer.py

pytorch_lightning/accelerators/accelerator.py

tests/overrides/test_data_parallel.py

pytorch_lightning/trainer/connectors/model_connector.py

pytorch_lightning/trainer/configuration_validator.py

pytorch_lightning/plugins/training_type/tpu_spawn.py

pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <[email protected]>

awaelchli

nice

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

EliaCereda force-pushed the feature/trainer-validate-1 branch from a0d0c4b to edb3e83 Compare December 2, 2020 12:47

EliaCereda changed the title ~~Add Trainer.validate(…) method to run one validation epoch [1/n]~~ Add Trainer.validate(…) method to run one validation epoch [1/2] Dec 2, 2020

EliaCereda marked this pull request as ready for review December 2, 2020 16:02

EliaCereda requested review from Borda, SeanNaren, ananyahjha93, awaelchli, justusschock, nateraw, tchaton, teddykoker and williamFalcon as code owners December 2, 2020 16:02

EliaCereda mentioned this pull request Dec 2, 2020

Add Trainer.validate(…) method to run one validation epoch #4707

Closed

11 tasks

Borda added this to the 1.2 milestone Dec 4, 2020

Borda added feature Is an improvement or enhancement refactor labels Dec 4, 2020

mergify bot requested a review from a team December 12, 2020 14:57

Borda changed the base branch from master to release/1.2-dev December 14, 2020 17:34

github-actions bot added the has conflicts label Jan 12, 2021

Borda assigned awaelchli and tchaton Jan 26, 2021

mergify bot removed the has conflicts label Jan 26, 2021

EliaCereda mentioned this pull request Jan 26, 2021

Add Trainer.validate(…) method to run one validation epoch #4948

Merged

10 tasks

mergify bot added the has conflicts label Mar 4, 2021

Merge branch 'master' into feature/trainer-validate-1

99db197

mergify bot added has conflicts and removed has conflicts labels Mar 4, 2021

Merge branch 'master' into feature/trainer-validate-1

1b0709d

mergify bot removed the has conflicts label Mar 5, 2021

williamFalcon approved these changes Mar 5, 2021

View reviewed changes

Typo

b64b46e

tchaton reviewed Mar 5, 2021

View reviewed changes

carmocca added 3 commits March 5, 2021 15:57

Address @tchaton's comments

7d42798

PEP8

7a3f8cd

Correct property

c0ef3fa

justusschock approved these changes Mar 5, 2021

View reviewed changes

Update CHANGELOG

63c9493

tchaton approved these changes Mar 5, 2021

View reviewed changes

tchaton enabled auto-merge (squash) March 5, 2021 18:48

awaelchli reviewed Mar 5, 2021

View reviewed changes

carmocca disabled auto-merge March 6, 2021 00:52

carmocca and others added 3 commits March 6, 2021 02:02

Apply suggestions from code review

10f7f21

Co-authored-by: Adrian Wälchli <[email protected]>

Update pytorch_lightning/trainer/trainer.py

d1dc4c9

Co-authored-by: Adrian Wälchli <[email protected]>

Remove called sanity check

45a010f

awaelchli approved these changes Mar 6, 2021

View reviewed changes

carmocca added the ready PRs ready to be merged label Mar 6, 2021

tchaton merged commit d0596fa into Lightning-AI:master Mar 6, 2021

carmocca mentioned this pull request Mar 6, 2021

Use f-"""-string in a Trainer comment #6377

Merged

7 tasks

awaelchli mentioned this pull request Mar 8, 2021

Refactor RunningStage routing mechanism #5593

Closed

tchaton mentioned this pull request Mar 9, 2021

1.2.x cherries 🍒 #6083

Closed

akihironitta mentioned this pull request Apr 3, 2021

Remove training_step returned None user warning when automatic_optimization=False #6339

Closed

dobbali mentioned this pull request Dec 9, 2021

[Bug] TuneReportCallback function uses Pytorch trainer attribute that has been deprecated ray-project/ray#21000

Closed

2 tasks

Refactor RunningStage usage in advance of implementing Trainer.validate() #4945

Refactor RunningStage usage in advance of implementing Trainer.validate() #4945

Uh oh!

Conversation

EliaCereda commented Dec 2, 2020 • edited by carmocca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Backwards compatibility

Before submitting

PR review

Uh oh!

codecov bot commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rohitgr7 commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EliaCereda commented Dec 2, 2020

Uh oh!

rohitgr7 commented Dec 2, 2020

Uh oh!

EliaCereda commented Dec 2, 2020

Uh oh!

Borda commented Jan 26, 2021

Uh oh!

rohitgr7 commented Jan 26, 2021

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

EliaCereda commented Dec 2, 2020 •

edited by carmocca

Loading

codecov bot commented Dec 2, 2020 •

edited

Loading

rohitgr7 commented Dec 2, 2020 •

edited

Loading