-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Bug
DDP-spawn on multi-node configurations where processes get spawned externally should not be supported.
To Reproduce
Running with
Trainer(strategy="ddp_spawn", num_nodes>1, devices=...)on LSF is currently faulty but should not be supported.
Expected behavior
Should not be supported. Two options:
A) Raise an error and suggest to use strategy="ddp".
B) Fall-back to DDP. A similar check already exists for SLURM:
I suggest to go with B) as we already do it with the other environments.
Additional context
Originally reported on Slack.
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
-
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
-
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.