-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicenvironment: slurmhelp wantedOpen to be worked onOpen to be worked onwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update
Description
🐛 Bug
An error is reported when using multiple nodes and multiple GPUs on slurm. It seems it always initialize on only one node.
I only meet this question in version 1.2.7, but all is ok in version 1.1.2.
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
LOCAL_RANK: 0 -CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
my trainer is:
trainer = Trainer(gpus=2, accelerator='ddp', num_nodes=2)
my slrum script is:
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
- PyTorch Version: 1.7.1
- OS: Centos
- How you installed PyTorch: pip
- Python version: 3.7.0
- CUDA/cuDNN version: 10.1
Additional context
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicenvironment: slurmhelp wantedOpen to be worked onOpen to be worked onwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update