Skip to content

Code freezes before validation sanity check when using DDP #7336

@notprime

Description

@notprime

🐛 Bug

Grretings from Italy!
I recently moved to PyTorch and a friend of mine introduced me to PL.
I'm coding an autoencoder (whose architecture is still pretty simple) using a custom loss function
which works on the hidden layer output. The link below leads to the github repo:

https://github.com/notprime/custom_autoencoder/blob/main/autoenc_torch.ipynb

I read the documentation about the Multi-GPU Training, so I used 'ddp' as accelerator,
and used gpus = -1 to select all the gpus.
However, when I launch the script, the code freezes there:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

I tried to wait 10-15 minutes, but nothing happened.
Instead, if I use 'dp' as accelerator, everything works fine, and the script doesn't freeze.
The documentation says that ddp is preferred over dp because it's faster:
is there something I did wrong? I really don't know why the code stucks if I use ddp !

Thanks in advance!

  • PyTorch Version: 1.8.1
  • OS: Ubuntu 18.04
  • How you installed PyTorch: 'conda'
  • Python version: 3.8
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: 4 x TITAN Xp 12GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions