Skip to content

CUDA OOM when using "ddp" mode in training  #7817

@choieq

Description

@choieq

🐛 Bug

When i use ddp, i got this CUDA out of memory error.
But it works when using dp mode in same batch size.. I don't understand this situation.

And why is there a memory stack to GPU 0? Although i don't use GPU 0, There is a lot of memory consumption.

Please reproduce using the BoringModel

 trainer = Trainer(fast_dev_run=False, gpus=args.gpu, max_epochs=args.epoch, distributed_backend='ddp', logger=tb_logger)  # distributed_backend='dp')
 trainer.fit(model=model, train_dataloader=train_loader, val_dataloaders=val_loader)

To Reproduce

Use following BoringModel and post here

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0): 1.6.0
  • Pytorch-lightning: 1.2.10
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): conda
  • Python version: 3.x
  • CUDA/cuDNN version:10.2

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions