Error against specifying duplicate GPU device ids?

### Discussed in https://github.com/PyTorchLightning/pytorch-lightning/discussions/8630

> I have a single GPU, but I would like to spawn multiple replicas on that single GPU and train a model with DDP. Of course, each replica would have to use a smaller batch size in order to fit in memory. (For my use case, I am not interested in having a single replica with a large batch size). I tried to pass `--gpus "0,0"` to the Lightning Trainer, and it managed to spawn two processes on the same GPU:
```
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
```
> But in the end it crashed with `RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage`.

Given that this isn't supported by the underlying backends, we can explicitly check against duplicate device ids and error out in the trainer if so: https://github.com/PyTorchLightning/pytorch-lightning/blob/c99e2fe0d2bf713f35054eaa0d521ee7f6030786/pytorch_lightning/utilities/device_parser.py#L53-L91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error against specifying duplicate GPU device ids? #8634

Discussed in #8630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error against specifying duplicate GPU device ids? #8634

Description

Discussed in #8630

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions