-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Description
🐛 Bug
I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify
pl.Trainer(gpus=[0])
It runs fine. However, once I add another GPU
pl.Trainer(gpus=[0,1,2,3])
I get this output:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.
Any idea why this may happen? I have tried with both ddp and ddp_spawn.
- PyTorch Version-- tried both 1.4 and 1.7
- OS-- Linux
- Installed with pip
- Python version: 3.8.5
- CUDA/cuDNN version: 10.1
- GPU models and configuration: NVIDIA K80s
madaan, pgmikhael, pesehr, youngerous, RobMulla and 39 more
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task