[Fix] Move init dist connection into the setup function #6506

SeanNaren · 2021-03-13T12:18:49Z

What does this PR do?

This fix moves the init ddp connection for DDP into the setup function, and reorders the hook such that setup can now have access to the initialized distributed environment. This is also important for FSDP.

This fix however diverges DDP Spawn from DDP, and should be noted in the docs. As @ananthsub and @awaelchli have brought up, we may need to discuss the responsibility of hooks, primarily because we're seeing some inflexibility in the API when changing call orders of hooks/custom loops.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

…we set up the accelerator

awaelchli · 2021-03-13T12:45:47Z

Some tests in tests\accererator\test_acceleratror_connector hang on my server. I'll investigate

…pe hook

…ng set up

awaelchli · 2021-03-17T13:45:55Z

DeepSpeed GPU tests had to be made into special tests, I was getting local errors since torch dist is being initialized and making them independent tests seems to fix them. Not high priority now, but we could wrap them potentially in a fixture such that torch distributed is deleted

We can't terminate the ddp connection after a trainer.fit call because we don't know if the user will call .fit or .test again right after it, and we want to enable that. The connection staying open could be a problem in pytest the way these tests are launched and stopped. A fixture could help us in the tests ensuring that no connections are open before running a new test, while still enabling the use case described above outside of the test environment.
One thing I would test is this: Take one of these deepspeed tests and copy the contents to a regular script and execute it. Does leave any connections open after it ends?

awaelchli · 2021-03-17T13:54:33Z

I have to call model_to_device twice in the single device plugin; once before and once within pre_dispatch. This is to handle the case if user's define layers within setup which need to be moved to the correct devices. This does mean if the user defines layer in setup, things can break using APEX since weights may not be transferred to half at initialization time. I'm going to investigate further to see if I can remove the model_to_device duplicate in the single device plugin, and fix the Apex plugin to handle this case independently

Maybe it helps to go back to before the accelerator refactor and look again what the order of calls were. The commit before acc refactor is 309ce7a
In the DDPAccelerator for example we can see the model is moved to the device after the setup hook is called.

SeanNaren · 2021-03-18T17:50:18Z

pytorch_lightning/accelerators/accelerator.py

+        self.setup_training_type_plugin(self.training_type_plugin, model)
        self.setup_optimizers(trainer)
-        self.connect_precision_plugin(self.precision_plugin)
+        self.setup_precision_plugin(self.precision_plugin)


If anyone extended and made their own accelerator, this will be a breaking change so might need to handle a deprecation path here

Do you need to rename them ?

I don't i guess? It's just a bit concerning because if I don't rename them, connect_training_type_plugin will be calling plugin.setup, and connect will be calling training_type_plugin.connect. Just confusing function names. I think if it becomes an issue we can make this BW compatible however in most cases it seems users should be defining plugins, not accelerators.

awaelchli · 2021-03-18T18:08:57Z

pytorch_lightning/plugins/training_type/ddp.py

-
-        # TODO: we moved it to the trainer.fit after calling pre_dispatch
-        #   ... need to double check that it is the correct place
-        # self.trainer.call_setup_hook(self.model)
-


yeah my silly todo....
"need to double check that it is the correct place"

Thanks for double checking @SeanNaren 😄

tests/special_tests.sh

tests/accelerators/test_accelerator_connector.py

tchaton

LGTM ! Great job !

# Conflicts: # CHANGELOG.md

ananthsub

🚀

Borda · 2021-03-18T21:38:31Z

pytorch_lightning/plugins/training_type/ddp.py

+            log.info("-" * 100)
+            log.info(f"distributed_backend={self.distributed_backend}")
+            log.info(f"All DDP processes registered. Starting ddp with {self.world_size} processes")
+            log.info("-" * 100)


btw, shall we have this as a single message intend for 4 separate?

Borda · 2021-03-18T21:39:53Z

tests/plugins/test_deepspeed_plugin.py



-@RunIf(deepspeed=True)
+@RunIf(min_gpus=1, deepspeed=True, special=True)


this is so cool :D

* Move connection setup into the setup function. Call setup hook after we set up the accelerator * Added CHANGELOG.md * fix setup order in callback test * fix input arguments in test * Mock distributed function, remove protection to turn into training type hook * Remove import * Add missing mock, ensure custom plugin does not create children process * Skip test on windows * Update deepspeed to init connection in setup * Do not initialize distributed module * Move DeepSpeed tests to special tests since dist communication is being set up * Special the test to see if this fixes CI * Delete accelerator connector test to see if its causing build to fail * Delete deepspeed test * Revert "Delete accelerator connector test to see if its causing build to fail" This reverts commit edde60b * Revert "Delete deepspeed test" This reverts commit 9d317429 * Reverse hook * Reverse setup hooks to debug again * Add todo so i know where i left off * For single device move in pre_dispatch after setup function * Add additional model to device hook if any additional parameters have been set * See if we can enable deepspeed tests * Revert "See if we can enable deepspeed tests" This reverts commit b5450de * See if this hook approach works * Introduce new granular hooks * Remove import, fix tpu spawn by moving the function to setup * Added missing special test Co-authored-by: Adrian Wälchli <[email protected]> (cherry picked from commit 4e9b453)

carmocca · 2021-03-29T20:53:19Z

Setting milestone as 1.3 as it requires separating the fix from the API change to get into 1.2.x.

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

Move connection setup into the setup function. Call setup hook after …

6bf721e

…we set up the accelerator

SeanNaren added bug Something isn't working distributed Generic distributed-related topic labels Mar 13, 2021

SeanNaren added this to the 1.2.x milestone Mar 13, 2021

SeanNaren requested review from Borda, ananthsub, awaelchli, carmocca, justusschock, kaushikb11 and tchaton March 13, 2021 12:18

SeanNaren requested a review from williamFalcon as a code owner March 13, 2021 12:18

SeanNaren self-assigned this Mar 13, 2021

SeanNaren and others added 3 commits March 13, 2021 12:20

Added CHANGELOG.md

1576176

fix setup order in callback test

7148ee6

fix input arguments in test

4fd0c02

SeanNaren added 4 commits March 13, 2021 19:20

Mock distributed function, remove protection to turn into training ty…

cbfa681

…pe hook

Remove import

2a1dfbf

Add missing mock, ensure custom plugin does not create children process

e9c3f83

Merge branch 'master' into fix/setup_ddp_hook

2141a1f

mergify bot added the has conflicts label Mar 15, 2021

Merge branch 'master' into fix/setup_ddp_hook

96ca54f

mergify bot removed the has conflicts label Mar 15, 2021

SeanNaren added 4 commits March 15, 2021 17:48

Skip test on windows

ffe1c3f

Update deepspeed to init connection in setup

1709cdb

Do not initialize distributed module

708f97f

Move DeepSpeed tests to special tests since dist communication is bei…

ec33b96

…ng set up

SeanNaren marked this pull request as draft March 16, 2021 11:31

Merge branch 'master' into fix/setup_ddp_hook

d782554

SeanNaren added 3 commits March 18, 2021 13:04

See if this hook approach works

0210f17

Introduce new granular hooks

1bae940

Remove import, fix tpu spawn by moving the function to setup

69d6c32

SeanNaren commented Mar 18, 2021

View reviewed changes

awaelchli approved these changes Mar 18, 2021

View reviewed changes

Added missing special test

91fff3a

tchaton approved these changes Mar 18, 2021

View reviewed changes

mergify bot added the has conflicts label Mar 18, 2021

Merge branch 'master' into fix/setup_ddp_hook

88e2e09

# Conflicts: # CHANGELOG.md

mergify bot removed the has conflicts label Mar 18, 2021

SeanNaren enabled auto-merge (squash) March 18, 2021 20:44

Clean up the setup comment, since its run on train and test

3eced98

SeanNaren mentioned this pull request Mar 18, 2021

Default process group is not initialized in setup() function #6318

Closed

SeanNaren merged commit 4e9b453 into master Mar 18, 2021

ananthsub approved these changes Mar 18, 2021

View reviewed changes

SeanNaren deleted the fix/setup_ddp_hook branch March 18, 2021 21:33

SeanNaren mentioned this pull request Mar 18, 2021

Add branch condition for calling move to device in prefetch (FSDP 3/n) #6342

Closed

11 tasks

Borda reviewed Mar 18, 2021

View reviewed changes

SeanNaren mentioned this pull request Mar 18, 2021

Allow training type plugin to delay optimizer creation (FSDP 2/n) #6331

Merged

11 tasks

carmocca mentioned this pull request Mar 22, 2021

1.2.x cherries 🍒 #6083

Closed

This was referenced Mar 23, 2021

fix back-compatibility for Accel #6655

Merged

Weekly Patch Release v.1.2.5 [full merge, no squash] #6646

Merged

carmocca modified the milestones: 1.2.x, 1.3 Mar 29, 2021



		@RunIf(deepspeed=True)
		@RunIf(min_gpus=1, deepspeed=True, special=True)

[Fix] Move init dist connection into the setup function #6506

[Fix] Move init dist connection into the setup function #6506

Uh oh!

Conversation

SeanNaren commented Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

awaelchli commented Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awaelchli commented Mar 17, 2021

Uh oh!

awaelchli commented Mar 17, 2021

Uh oh!

SeanNaren Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

tchaton Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

SeanNaren Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

ananthsub left a comment

Choose a reason for hiding this comment

Uh oh!

Borda Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

Borda Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

carmocca commented Mar 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SeanNaren commented Mar 13, 2021 •

edited

Loading

awaelchli commented Mar 13, 2021 •

edited

Loading