Skip to content

Code stuck on "initalizing ddp" when using more than one gpu  #4612

@JosephGatto

Description

@JosephGatto

🐛 Bug

I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify

pl.Trainer(gpus=[0])

It runs fine. However, once I add another GPU

pl.Trainer(gpus=[0,1,2,3])

I get this output:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.

Any idea why this may happen? I have tried with both ddp and ddp_spawn.

  • PyTorch Version-- tried both 1.4 and 1.7
  • OS-- Linux
  • Installed with pip
  • Python version: 3.8.5
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: NVIDIA K80s

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 1Medium priority task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions