-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
app (removed)Generic label for Lightning App packageGeneric label for Lightning App packagepriority: 1Medium priority taskMedium priority task
Description
Bug description
When running on MPS capable devices, the multinode component sets accelerator="auto", devices="auto" and strategy="ddp" which will result in MPSAccelerator.
Since distributed on MPS is not supported, it has to be run on CPU for all apple silicon platforms instead.
How to reproduce the bug
# app.py
import lightning as L
from lightning.app.components import LightningTrainerMultiNode
from lightning.pytorch.demos.boring_classes import BoringModel
class LightningTrainerDistributed(L.LightningWork):
@staticmethod
def run():
model = BoringModel()
trainer = L.Trainer(max_epochs=10, strategy="ddp") # same with any combination of strategy, accelerator and devices as this will be handled by the Multinodecomponent
trainer.fit(model)
# 8 GPU: (2 nodes of 4 x v100)
component = LightningTrainerMultiNode(
LightningTrainerDistributed,
num_nodes=4,
cloud_compute=L.CloudCompute("gpu-fast-multi"), # 4 x v100
)
app = L.LightningApp(component)
on apple silicon devicesError messages and logs
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/app/components/multi_node/pytorch_spawn.py", line 58, in dispatch_run
cls.run(local_rank, unwrap(work.run), *args, **kwargs)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/app/components/multi_node/trainer.py", line 69, in run
work_run()
File "/Users/nohaalon/Desktop/bcn/app2.py", line 12, in run
trainer.fit(model)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 578, in fit
call._call_and_handle_interrupt(
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 620, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1038, in _run
self.strategy.setup(self)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 179, in setup
self.configure_ddp()
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 295, in configure_ddp
self.model = self._setup_model(LightningDistributedModule(self.model))
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 195, in _setup_model
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/Users/nohaalon/opt/miniconda3/envs/stag-34/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: ProcessGroupGloo::allgather: unsupported device type mps
Environment
No response
More info
Cannot handle for plain PyTorch, need to handle for Lite and Trainer MultiNode.
Metadata
Metadata
Assignees
Labels
app (removed)Generic label for Lightning App packageGeneric label for Lightning App packagepriority: 1Medium priority taskMedium priority task