Skip to content

NCCL error using DDP #6527

@zhangdan8962

Description

@zhangdan8962

🐛 Bug

Getting this error when attempting to use ddp with VAE:

Traceback (most recent call last):
  File "/home/dzhang4/VAE/main.py", line 32, in <module>
    trainer.fit(model, train_dataloader=train_loader)
  File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 303, in ddp_train
    self.barrier('ddp_setup')
  File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 186, in barrier
    torch_distrib.barrier(group=self.ddp_plugin.data_parallel_group)
  File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

To Reproduce

from pl_bolts.models.autoencoders import VAE
from pl_bolts.datamodules import ImagenetDataModule
import pytorch_lightning as pl
import torchvision.datasets as datasets
import torch
import torchvision.transforms as transforms
import os

model = VAE(224,'resnet50',enc_out_dim=2048,latent_dim=256)
traindir = os.path.join('/home/dzhang4/Imagenet', 'train')
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
train_dataset = datasets.ImageFolder(
    traindir,
    transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))


train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)


trainer = pl.Trainer(gpus=[2,3,4,5,6,7,8,9],profiler=True,accelerator='ddp')
trainer.fit(model, train_dataloader=train_loader)

Expected behavior

Works on multi-GPUs

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
      • Tesla V100-SXM3-32GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.7.1
    • pytorch-lightning: 1.1.8
    • tqdm: 4.46.1
  • System:

Metadata

Metadata

Assignees

No one assigned

    Labels

    3rd partyRelated to a 3rd-partydistributedGeneric distributed-related topicduplicateThis issue or pull request already existshelp wantedOpen to be worked onworking as intendedWorking as intended

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions