RFC: Make subprocess DDP the default when selecting multiple devices

## 🚀 Feature

RFC: Switch to strategy="ddp" when selecting devices>1.

I'm putting this RFC out so it can be discussed. There is no urgency to this and we don't know what the preferences of the users are. If there is not enough interest, the issue would be closed. 

### Motivation

`accelerator="ddp_spawn"` is the default strategy selected by the Trainer when setting `Trainer(devices>1, accelerator="cpu"|"gpu)`. DDP Spawn has two main limitations:

1. Objects attached to model, trainer, callbacks etc. need to be pickle-able (#14198).
2. When processes join and program flow returns to the main process, most state that changed in the worker(s) is lost. Model weights, metrics and some attributes on the model checkpoint callback get restored manually.  

Regular DDP based on launching subprocesses doesn't have these issues. Historically, ddp_spawn was chosen as the default because it also worked in Jupyter notebooks, but that was before PyTorch switched the default start method. We now have different solutions for running ddp in Jupyter (#13405). 

### Pitch

Swap the default from ddp_spawn to ddp. 

This would be a breaking change and should be considered for 2.0. There is also the possibility that people prefer to run ddp_spawn anyway, in which case we shouldn't make the switch. 

Pros:
- Addresses the shortcomings of ddp spawn as explained above
- When familiar with ddp, it is more natural to switch to other strategies like deepspeed, bagua, collaborative, ...

Cons:
- If processes get stuck (a subset hangs or errors), you might end up with zombies and have to kill them manually. This doesn't happen so easily in ddp_spawn, because killing the main process is usually enough to kill all of them. 

### Alternatives

Keep it as is.

### Additional context


______________________________________________________________________

#### If you enjoy Lightning, check out our other projects! ⚡

- [**Metrics**](https://github.com/Lightning-AI/metrics): Machine learning metrics for distributed, scalable PyTorch applications.

- [**Lite**](https://pytorch-lightning.readthedocs.io/en/latest/starter/lightning_lite.html): enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

- [**Flash**](https://github.com/Lightning-AI/lightning-flash): The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

- [**Bolts**](https://github.com/Lightning-AI/lightning-bolts): Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

- [**Lightning Transformers**](https://github.com/Lightning-AI/lightning-transformers): Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.


cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Make subprocess DDP the default when selecting multiple devices #14075

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Make subprocess DDP the default when selecting multiple devices #14075

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions