c10::error during DistributedDataParallel training

Hi, I am coming across this strange error when I train a MinkUNet using Pytorch Lightning/DistributedDataParallel on 1 node with 4 K80 gpus. The cluster I am using also has P100 and V100 gpus, but the model works fine on those (with 1 node).

```
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1be4ca61e2 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f1be4ef4f92 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f1be4c949cd in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x25a (0x7f1c30d825da in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x28a (0x7f1c30d7785a in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f1c30d57102 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f1c30554b56 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xa6bdcb (0x7f1c30d57dcb in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x273fc0 (0x7f1c3055ffc0 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x27520e (0x7f1c3056120e in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #18: __libc_start_main + 0xe7 (0x7f1c32b71b97 in /lib/x86_64-linux-gnu/libc.so.6)
```

I tried again using horovod (with pytorch, ME, nccl, infiniband, etc installed inside a singularity container). All GPU types with 1 node succeed, but then they all fail when I increase number of nodes to 2 with the same error as above.

According to https://github.com/pytorch/pytorch/issues/13541, it has to do with GLIBCXX_USE_CXX11_ABI set to 1, but I cannot seem to turn it off. 

I also posted this to pytorch discuss (https://discuss.pytorch.org/t/setting-glibcxx-use-cxx11-abi-to-0/98054) since I thought it might be more relevant to pytorch. However, I also saw a few people with same error here (https://github.com/NVIDIA/MinkowskiEngine/issues/205, https://github.com/NVIDIA/MinkowskiEngine/issues/193) so I thought I'd ask here as well.

If you have any ideas what could be causing this, I would would really appreciate it.

Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

c10::error during DistributedDataParallel training #235

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

c10::error during DistributedDataParallel training #235

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions