-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicenvironment: slurmwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or updatewon't fixThis will not be worked onThis will not be worked on
Description
🐛 Bug
I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:
trainer = pl.Trainer(
gpus=2, num_nodes=2,
accelerator='ddp',
max_epochs=2
)and submit the job with sbatch run_training.sh . However, I end up with the following output and nothing happens further:
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.
run_training.sh
#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G
srun python torch_ddp_toy.py
torch_ddp_toy.py
import pytorch_lightning as pl
import torch
from torch import nn
class Module(pl.LightningModule):
def __init__(self):
super().__init__()
self.linear = nn.Linear(5, 1)
def configure_optimizers(self):
return torch.optim.Adam(self.linear.parameters())
def training_step(self, batch, batch_idx):
return self.linear(batch).sum()
def validation_step(self, batch, batch_idx):
return batch_idx
def validation_epoch_end(self, outputs):
print("VALIDATING", len(outputs))
if __name__ == "__main__":
m = Module()
datasets = [torch.rand([5]) for __ in range(100)]
train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)
trainer = pl.Trainer(
gpus=2, num_nodes=2,
accelerator='ddp',
max_epochs=2
)
trainer.fit(m, train_loader, val_loader)- PyTorch version 1.7.1
- PyTorch Lightning version 1.2.0
- CentOS Linux release 8.1.1911
- PyTorch installed via conda
- PyTorch Lightning via pip
- slurm 20.02.3
UPDATE: added version of PyTorch Lightning
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicenvironment: slurmwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or updatewon't fixThis will not be worked onThis will not be worked on