Skip to content

Error against specifying duplicate GPU device ids? #8634

@ananthsub

Description

@ananthsub

Discussed in #8630

I have a single GPU, but I would like to spawn multiple replicas on that single GPU and train a model with DDP. Of course, each replica would have to use a smaller batch size in order to fit in memory. (For my use case, I am not interested in having a single replica with a large batch size). I tried to pass --gpus "0,0" to the Lightning Trainer, and it managed to spawn two processes on the same GPU:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2

But in the end it crashed with RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage.

Given that this isn't supported by the underlying backends, we can explicitly check against duplicate device ids and error out in the trainer if so: https://github.com/PyTorchLightning/pytorch-lightning/blob/c99e2fe0d2bf713f35054eaa0d521ee7f6030786/pytorch_lightning/utilities/device_parser.py#L53-L91

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions