-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Lightning should offer a central place to use the collective functions provided here: https://pytorch.org/docs/stable/distributed.html#collective-functions
Motivation
LightningModule code is usually agnostic to what device its running on or whether its running in a distributed training environment. However, there are times where the module does need to rely on collective functions.
In Lightning, we currently have many places where these are offered:
-
On this distributed object, which only supports
broadcast: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/distributed/dist.py -
reduce,barrier,broadcast,all_gather, andreduce_boolean_decisionare on the trainer's accelerator and training type plugin: -
https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/accelerators/accelerator.py#L431-L455
https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/plugins/training_type/training_type_plugin.py#L78-L103 -
more utilities for gathering tensors, all_gather, and sync_ddp here: https://github.com/PyTorchLightning/pytorch-lightning/blob/b9a52fa2ef31f12f6992ece18a033318ec551907/pytorch_lightning/utilities/distributed.py#L86-L217
-
all_gatherrepeated again here on the lightning module, calling the trainer's accelerator functions: https://github.com/PyTorchLightning/pytorch-lightning/blob/233f252bb427c930be8e7ca56fe115b637278b8d/pytorch_lightning/core/lightning.py#L506-L532
Some of these call each other and the dependency isn't very clear now, so it is confusing for users which to go through.
Pitch
- Offer these utilities under a central place:
pytorch_lightning/utilities/collectives.pyfor these utilities:
barrier,all_gather,broadcast, etc
These should be very thin wrappers over the PyTorch distributed functions, checking if torch.distributed is available and initialized. If not, we return what's expected for single-process training.
-
Update the callsites internally to use to these implementations
-
Mark existing functions as deprecated and slated for removal in v1.6