Skip to content

RFC: Make subprocess DDP the default when selecting multiple devices #14075

@awaelchli

Description

@awaelchli

🚀 Feature

RFC: Switch to strategy="ddp" when selecting devices>1.

I'm putting this RFC out so it can be discussed. There is no urgency to this and we don't know what the preferences of the users are. If there is not enough interest, the issue would be closed.

Motivation

accelerator="ddp_spawn" is the default strategy selected by the Trainer when setting Trainer(devices>1, accelerator="cpu"|"gpu). DDP Spawn has two main limitations:

  1. Objects attached to model, trainer, callbacks etc. need to be pickle-able (ValueError: ctypes objects containing pointers cannot be pickled #14198).
  2. When processes join and program flow returns to the main process, most state that changed in the worker(s) is lost. Model weights, metrics and some attributes on the model checkpoint callback get restored manually.

Regular DDP based on launching subprocesses doesn't have these issues. Historically, ddp_spawn was chosen as the default because it also worked in Jupyter notebooks, but that was before PyTorch switched the default start method. We now have different solutions for running ddp in Jupyter (#13405).

Pitch

Swap the default from ddp_spawn to ddp.

This would be a breaking change and should be considered for 2.0. There is also the possibility that people prefer to run ddp_spawn anyway, in which case we shouldn't make the switch.

Pros:

  • Addresses the shortcomings of ddp spawn as explained above
  • When familiar with ddp, it is more natural to switch to other strategies like deepspeed, bagua, collaborative, ...

Cons:

  • If processes get stuck (a subset hangs or errors), you might end up with zombies and have to kill them manually. This doesn't happen so easily in ddp_spawn, because killing the main process is usually enough to kill all of them.

Alternatives

Keep it as is.

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @Borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions