Skip to content

Multinode standalone not working on mps #15713

@justusschock

Description

@justusschock

Bug description

When running on MPS capable devices, the multinode component sets accelerator="auto", devices="auto" and strategy="ddp" which will result in MPSAccelerator.
Since distributed on MPS is not supported, it has to be run on CPU for all apple silicon platforms instead.

How to reproduce the bug

# app.py
import lightning as L
from lightning.app.components import LightningTrainerMultiNode
from lightning.pytorch.demos.boring_classes import BoringModel


class LightningTrainerDistributed(L.LightningWork):
    @staticmethod
    def run():
        model = BoringModel()
        trainer = L.Trainer(max_epochs=10, strategy="ddp") # same with any combination of strategy, accelerator and devices as this will be handled by the Multinodecomponent
        trainer.fit(model)

# 8 GPU: (2 nodes of 4 x v100)
component = LightningTrainerMultiNode(
    LightningTrainerDistributed,
    num_nodes=4,
    cloud_compute=L.CloudCompute("gpu-fast-multi"), # 4 x v100
)
app = L.LightningApp(component)
 

on apple silicon devices

Error messages and logs

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/app/components/multi_node/pytorch_spawn.py", line 58, in dispatch_run
    cls.run(local_rank, unwrap(work.run), *args, **kwargs)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/app/components/multi_node/trainer.py", line 69, in run
    work_run()
  File "/Users/nohaalon/Desktop/bcn/app2.py", line 12, in run
    trainer.fit(model)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 578, in fit
    call._call_and_handle_interrupt(
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 620, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1038, in _run
    self.strategy.setup(self)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 179, in setup
    self.configure_ddp()
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 295, in configure_ddp
    self.model = self._setup_model(LightningDistributedModule(self.model))
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 195, in _setup_model
    return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: ProcessGroupGloo::allgather: unsupported device type mps

Environment

No response

More info

Cannot handle for plain PyTorch, need to handle for Lite and Trainer MultiNode.

cc @tchaton @nohalon

Metadata

Metadata

Assignees

Labels

app (removed)Generic label for Lightning App packagepriority: 1Medium priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions