Skip to content

RuntimeError: CUDA error: invalid device ordinal in pytorch_lightning version:1.2.7 #7027

@zhhao1

Description

@zhhao1

🐛 Bug

An error is reported when using multiple nodes and multiple GPUs on slurm. It seems it always initialize on only one node.
I only meet this question in version 1.2.7, but all is ok in version 1.1.2.
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
my trainer is:
trainer = Trainer(gpus=2, accelerator='ddp', num_nodes=2)
my slrum script is:
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2

  • PyTorch Version: 1.7.1
  • OS: Centos
  • How you installed PyTorch: pip
  • Python version: 3.7.0
  • CUDA/cuDNN version: 10.1

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topicenvironment: slurmhelp wantedOpen to be worked onwaiting on authorWaiting on user action, correction, or update

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions