RuntimeError: CUDA error: invalid device ordinal in pytorch_lightning version:1.2.7

## 🐛 Bug

An error is reported when using multiple nodes and multiple GPUs on slurm. It seems it always initialize on only one node.
I only meet this question in version 1.2.7, but all is ok in version 1.1.2.
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
my trainer is:
trainer = Trainer(gpus=2, accelerator='ddp', num_nodes=2)
my slrum script is:
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2

 - PyTorch Version:  1.7.1
 - OS:  Centos
 - How you installed PyTorch: pip
 - Python version: 3.7.0
 - CUDA/cuDNN version: 10.1


### Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: CUDA error: invalid device ordinal in pytorch_lightning version:1.2.7 #7027

🐛 Bug

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: CUDA error: invalid device ordinal in pytorch_lightning version:1.2.7 #7027

Description

🐛 Bug

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions