Skip to content

Conversation

@EliaCereda
Copy link
Contributor

@EliaCereda EliaCereda commented Dec 2, 2020

What does this PR do?

author: @carmocca. original author: @EliaCereda

Refactor RunningStage usage in advance of implementing Trainer.validate(...).

  • Define trainer.evaluating to check if we are on RunningStage.VALIDATING or RunningStage.TESTING
  • Define RunningStage.SANITY_CHECKING
  • Define TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
  • Deprecate trainer.running_sanity_check in favor of trainer.sanity_checking
  • Update the other components to use trainer.evaluating instead of trainer.testing
  • Disable the EarlyStopping and ModelCheckpoint callbacks when not TrainerState.FITTING. This has no effect when evaluating on the test set, since they were already disabled, but it will be necessary for the validation set.
  • Rename a few other attributes of Trainer to clarify that they will be used by both test(…) and validate(…)

Fixes #5593

Feature request issue #4634
Split from PR #4707, see for full discussion
Second PR #4948, for context

Backwards compatibility

Unfortunately, there is already trainer.evaluating to indicate if validation is being run. That property has been changed to trainer.validating. I don't see any way we can avoid this compatibility break but it should be mostly fine as we are the ones who rely on these attributes

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR

@EliaCereda EliaCereda force-pushed the feature/trainer-validate-1 branch from a0d0c4b to edb3e83 Compare December 2, 2020 12:47
@codecov
Copy link

codecov bot commented Dec 2, 2020

Codecov Report

Merging #4945 (c46ae62) into master (46540ee) will increase coverage by 0%.
The diff coverage is 91%.

@@          Coverage Diff           @@
##           master   #4945   +/-   ##
======================================
  Coverage      91%     92%           
======================================
  Files         161     161           
  Lines       11457   11479   +22     
======================================
+ Hits        10481   10512   +31     
+ Misses        976     967    -9     

@EliaCereda EliaCereda changed the title Add Trainer.validate(…) method to run one validation epoch [1/n] Add Trainer.validate(…) method to run one validation epoch [1/2] Dec 2, 2020
@EliaCereda EliaCereda marked this pull request as ready for review December 2, 2020 16:02
@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 2, 2020

In trainer/evaluation_loop.py and accelerators/tpu_accelerator.py it still uses self.testing at some places. Can you check??

@EliaCereda
Copy link
Contributor Author

You're absolutely correct about accelerators/tpu_accelerator.py, it is still using Trainer.testing. In addition, I found other usages in plugins/sharded_plugin.py, accelerators/ddp_spawn_accelerator.py, tests/model_train_steps.py and possibly in connectors/logger_connector.py. I must have missed these occurrences, tomorrow I'll review them.

On the other hand, it seems to me that evaluation_loop.py only accesses EvaluationLoop.testing which is a different attribute: its purpose is to determine whether EvaluationLoop is currently in the validation or test loop and Trainer already sets it correctly from this line. It will be False during the validation phase of fit() and in validate(), but True during test().

Let me know if I missed any usage of Trainer.testing directly in EvaluationLoop.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 2, 2020

EvaluationLoop.testing doesn't seems to be doing anything special other than checking whether to load and call test methods or eval methods. I think we can easily do this with self.trainer.evaluating.

if self.testing:
    do something
else:
    do something else

can be:

if self.trainer.evaluating == 'test':
    do something
else:
    do something else

but need to make sure it's correct and not breaking anything else.

@EliaCereda
Copy link
Contributor Author

Yes, I also think that would work. As you say, the harder part is ensuring everything works correctly, which is not trivial when the logic is intertwined over a number of files.

This is the main reason I stopped the refactoring at the attributes on Trainer.

In addition, EvaluationLoop specifically doesn’t need to distinguish fit, validate and test but only whether the current evaluation is over validation or test. This is what replacing testing with evaluating essentially makes possibile, but it would not be of much benefit here.

And Trainer is a user-facing class, where it was important to both provide a coherent API and not break existing users which were potentially already using testing or tested_checkpoint_path. With EvaluationLoop being mainly an internal object, hidden from users, maybe it makes sense to leave this as a future refactor?

@Borda Borda added this to the 1.2 milestone Dec 4, 2020
@Borda Borda added feature Is an improvement or enhancement refactor labels Dec 4, 2020
@mergify mergify bot requested a review from a team December 12, 2020 14:57
@Borda Borda changed the base branch from master to release/1.2-dev December 14, 2020 17:34
@Borda
Copy link
Collaborator

Borda commented Jan 26, 2021

hi @EliaCereda how is going here? just waiting for review?
cc: @tchaton @awaelchli

@rohitgr7
Copy link
Contributor

@EliaCereda I have cleaned #4945 (comment) on master. Will be available once 1.1.5 get's synced to 1.2-dev: #5583

@mergify mergify bot added the has conflicts label Mar 4, 2021
@mergify mergify bot removed the has conflicts label Mar 5, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR. Small comments !

@tchaton tchaton enabled auto-merge (squash) March 5, 2021 18:48
@carmocca carmocca disabled auto-merge March 6, 2021 00:52
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@carmocca carmocca added the ready PRs ready to be merged label Mar 6, 2021
@tchaton tchaton merged commit d0596fa into Lightning-AI:master Mar 6, 2021
@tchaton tchaton mentioned this pull request Mar 9, 2021
facebook-github-bot pushed a commit to facebookresearch/d2go that referenced this pull request Apr 14, 2021
…ter) to github/third-party/PyTorchLightning/pytorch-lightning

Summary:
### New commit log messages
## [UnReleased] - 2021-MM-DD

### Added

- Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667))

- Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492))

- Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417))

- Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)).

- Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470))

- Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072))

- Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948))

- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915))

- Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673))

- Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277))

- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274))

- Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370))

- Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633))

- Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139))

- Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543))

- Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621))

- Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618))

- Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120))

- Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679))

- Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595))

- Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736))

- Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677))

- Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764))

### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259))

- Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621))

- Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349))

### Deprecated

- `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621))

- Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505),
    [#6530](Lightning-AI/pytorch-lightning#6530),
    [#6540](Lightning-AI/pytorch-lightning#6540),
    [#6547](Lightning-AI/pytorch-lightning#6547),
    [#6515](Lightning-AI/pytorch-lightning#6515),
    [#6572](Lightning-AI/pytorch-lightning#6572),
    [#6573](Lightning-AI/pytorch-lightning#6573),
    [#6584](Lightning-AI/pytorch-lightning#6584),
    [#6636](Lightning-AI/pytorch-lightning#6636),
    [#6637](Lightning-AI/pytorch-lightning#6637),
    [#6649](Lightning-AI/pytorch-lightning#6649),
    [#6659](Lightning-AI/pytorch-lightning#6659),
)

### Removed

- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164))

- Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139))

- Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166))

- Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163))

- Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161))
    * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve`
    * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce`

- Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162))

- Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167))

- Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016))

- Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207))

- Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734))

- Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093))

### Fixed

- Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802))

- Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011))

- Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070))

- Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109))

- Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436))

- Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416))

- Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506))

- Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705))

## [1.2.7] - 2021-04-06

### Fixed

- Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741))
- Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828))
- Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836))
- Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588))
- Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816))
- Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730))

## [1.2.6] - 2021-03-30

### Changed

- Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498))

### Removed

- Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682))

### Fixed

- Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398))
- Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657))
- Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719))

## [1.2.5] - 2021-03-23

### Changed

- Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576))
- Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590))

### Fixed

- Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587))
- Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434))
- Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275))
- Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565))

Reviewed By: shuyingsunshine21

Differential Revision: D27528929

fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Is an improvement or enhancement ready PRs ready to be merged refactor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor RunningStage routing mechanism

10 participants