Skip to content

DDP-spawn on LSF cluster (multi-node) should not be supported #15103

@awaelchli

Description

@awaelchli

🚀 Bug

DDP-spawn on multi-node configurations where processes get spawned externally should not be supported.

To Reproduce

Running with

Trainer(strategy="ddp_spawn", num_nodes>1, devices=...)

on LSF is currently faulty but should not be supported.

Expected behavior

Should not be supported. Two options:

A) Raise an error and suggest to use strategy="ddp".
B) Fall-back to DDP. A similar check already exists for SLURM:

https://github.com/Lightning-AI/lightning/blob/d2840a20bd27a0a86fc59ba54d94543e63523133/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L616-L619

I suggest to go with B) as we already do it with the other environments.

Additional context

Originally reported on Slack.


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions