-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
3rd partyRelated to a 3rd-partyRelated to a 3rd-partydistributedGeneric distributed-related topicGeneric distributed-related topicduplicateThis issue or pull request already existsThis issue or pull request already existshelp wantedOpen to be worked onOpen to be worked onworking as intendedWorking as intendedWorking as intended
Description
🐛 Bug
Getting this error when attempting to use ddp with VAE:
Traceback (most recent call last):
File "/home/dzhang4/VAE/main.py", line 32, in <module>
trainer.fit(model, train_dataloader=train_loader)
File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
results = self.accelerator_backend.train()
File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 158, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 303, in ddp_train
self.barrier('ddp_setup')
File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 186, in barrier
torch_distrib.barrier(group=self.ddp_plugin.data_parallel_group)
File "/home/dzhang4/miniconda3/envs/coco/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
To Reproduce
from pl_bolts.models.autoencoders import VAE
from pl_bolts.datamodules import ImagenetDataModule
import pytorch_lightning as pl
import torchvision.datasets as datasets
import torch
import torchvision.transforms as transforms
import os
model = VAE(224,'resnet50',enc_out_dim=2048,latent_dim=256)
traindir = os.path.join('/home/dzhang4/Imagenet', 'train')
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
train_dataset = datasets.ImageFolder(
traindir,
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
trainer = pl.Trainer(gpus=[2,3,4,5,6,7,8,9],profiler=True,accelerator='ddp')
trainer.fit(model, train_dataloader=train_loader)Expected behavior
Works on multi-GPUs
Environment
- CUDA:
- GPU:
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- Tesla V100-SXM3-32GB
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.8
- tqdm: 4.46.1
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.6.10
- version: 0.4.0 release - final checks (releasing later today) #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019
Metadata
Metadata
Assignees
Labels
3rd partyRelated to a 3rd-partyRelated to a 3rd-partydistributedGeneric distributed-related topicGeneric distributed-related topicduplicateThis issue or pull request already existsThis issue or pull request already existshelp wantedOpen to be worked onOpen to be worked onworking as intendedWorking as intendedWorking as intended