Skip to content

Training stuck running on the SLURM cluster with multiple gpus per node #6206

@DerJFK

Description

@DerJFK

🐛 Bug

I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:

trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

and submit the job with sbatch run_training.sh . However, I end up with the following output and nothing happens further:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.

run_training.sh

#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

srun python torch_ddp_toy.py

torch_ddp_toy.py

import pytorch_lightning as pl
import torch
from torch import nn

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(100)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)

    trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    trainer.fit(m, train_loader, val_loader)
  • PyTorch version 1.7.1
  • PyTorch Lightning version 1.2.0
  • CentOS Linux release 8.1.1911
  • PyTorch installed via conda
  • PyTorch Lightning via pip
  • slurm 20.02.3

UPDATE: added version of PyTorch Lightning

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topicenvironment: slurmwaiting on authorWaiting on user action, correction, or updatewon't fixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions