consistent behavior for reduce method across all Plugins #6011

awaelchli · 2021-02-16T15:17:21Z

What does this PR do?

Follow up to #5743, adds missing docs for the reduce method (formerly known as sync_tensor) and makes the default reduction equal to averaging. This was not consistent for all accelerators. Most of the time we need averaging, hence we make this the default.

So far, the reduce method is only sparsely used (and tested) and is not directly exposed to user. It should be safe to make this change, also considering that the whole accelerator/plugin API anyway changed and it is also labelled as experimental.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-02-16T15:37:04Z

Codecov Report

Merging #6011 (14bdd1e) into master (0b27147) will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #6011   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         160     160           
  Lines       11402   11402           
======================================
- Hits        10654   10625   -29     
- Misses        748     777   +29

awaelchli · 2021-02-17T16:14:15Z

pytorch_lightning/plugins/training_type/ddp.py

+            reduced value, except when the input was not a tensor the output remains is unchanged
+        """
+        if isinstance(tensor, torch.Tensor):
+            tensor = sync_ddp_if_available(tensor, group, reduce_op=(reduce_op or "mean"))


Changed this, to make the default "mean".
Let me know what you think @tchaton @justusschock

If you agree with this, I will change the PR title, otherwise I will revert back to a pure docs update.

I think it makes sense to have mean as default. What is PyTorch default ?

pytorch differentiates between reduce (only process 0 gets result) and all_reduce (all processes get the result). The default reduction is SUM.
https://pytorch.org/docs/stable/distributed.html?highlight=all_reduce#torch.distributed.all_reduce

Our internal implementation uses torch.distributed.all_reduce with SUM and for mean we divide by world_size manually.

Borda · 2021-02-17T19:38:28Z

@awaelchli seems to be unblocked :]

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

reduction docs

00b7693

awaelchli added the docs Documentation related label Feb 16, 2021

docs for abstract base method

985aa92

awaelchli mentioned this pull request Feb 17, 2021

[HotFix] Resolve TPU Training #6027

Merged

11 tasks

make mean the default

d841d77

awaelchli changed the title ~~docs for the reduce method in Plugins (former sync_tensor)~~ [blocked by #6027] docs for the reduce method in Plugins (former sync_tensor) Feb 17, 2021

awaelchli commented Feb 17, 2021

View reviewed changes

Borda changed the title ~~[blocked by #6027] docs for the reduce method in Plugins (former sync_tensor)~~ docs for the reduce method in Plugins (former sync_tensor) Feb 17, 2021

awaelchli changed the title ~~docs for the reduce method in Plugins (former sync_tensor)~~ make reduce method in Plugins more consistent Feb 18, 2021

awaelchli added 2 commits February 18, 2021 01:09

Merge branch 'master' into docs/reduce

c067080

Merge branch 'master' into docs/reduce

934d92f

awaelchli added the design Includes a design discussion label Feb 18, 2021

awaelchli added 2 commits February 18, 2021 23:46

Merge remote-tracking branch 'origin/docs/reduce' into docs/reduce

2eb7a03

add preliminary chlog

1feda4f

awaelchli modified the milestones: 1.2.x, 1.3 Feb 19, 2021

awaelchli marked this pull request as ready for review February 19, 2021 00:41

awaelchli requested review from Borda, SeanNaren, carmocca, justusschock and williamFalcon as code owners February 19, 2021 00:41

mergify bot added the has conflicts label Feb 19, 2021

awaelchli changed the title ~~make reduce method in Plugins more consistent~~ consistent behavior for reduce method across all Plugins Feb 19, 2021

Merge branch 'master' into docs/reduce

9598c56

mergify bot removed the has conflicts label Feb 19, 2021

Borda approved these changes Feb 19, 2021

View reviewed changes

Borda enabled auto-merge (squash) February 19, 2021 07:10

Borda added the ready PRs ready to be merged label Feb 19, 2021

carmocca approved these changes Feb 19, 2021

View reviewed changes

justusschock approved these changes Feb 20, 2021

View reviewed changes

Borda merged commit 3bdc067 into master Feb 20, 2021

Borda deleted the docs/reduce branch February 20, 2021 12:30

slumericanBx mentioned this pull request Oct 27, 2023

[Snyk] Security upgrade axios from 0.25.0 to 1.6.0 slumericanBx/lightning#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

consistent behavior for reduce method across all Plugins #6011

consistent behavior for reduce method across all Plugins #6011

Uh oh!

awaelchli commented Feb 16, 2021 •

edited by Borda

Loading

Uh oh!

codecov bot commented Feb 16, 2021 •

edited

Loading

Uh oh!

awaelchli Feb 17, 2021

Uh oh!

tchaton Feb 17, 2021

Uh oh!

awaelchli Feb 18, 2021

Uh oh!

Borda commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

consistent behavior for reduce method across all Plugins #6011

consistent behavior for reduce method across all Plugins #6011

Uh oh!

Conversation

awaelchli commented Feb 16, 2021 • edited by Borda Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

awaelchli Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

tchaton Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

Borda commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awaelchli commented Feb 16, 2021 •

edited by Borda

Loading

codecov bot commented Feb 16, 2021 •

edited

Loading