Skip to content

c10::error during DistributedDataParallel training #235

@edraizen

Description

@edraizen

Hi, I am coming across this strange error when I train a MinkUNet using Pytorch Lightning/DistributedDataParallel on 1 node with 4 K80 gpus. The cluster I am using also has P100 and V100 gpus, but the model works fine on those (with 1 node).

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f1be4ca61e2 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f1be4ef4f92 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f1be4c949cd in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x25a (0x7f1c30d825da in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x28a (0x7f1c30d7785a in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f1c30d57102 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f1c30554b56 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xa6bdcb (0x7f1c30d57dcb in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x273fc0 (0x7f1c3055ffc0 in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x27520e (0x7f1c3056120e in /project/ppi_workspace/py37/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #18: __libc_start_main + 0xe7 (0x7f1c32b71b97 in /lib/x86_64-linux-gnu/libc.so.6)

I tried again using horovod (with pytorch, ME, nccl, infiniband, etc installed inside a singularity container). All GPU types with 1 node succeed, but then they all fail when I increase number of nodes to 2 with the same error as above.

According to pytorch/pytorch#13541, it has to do with GLIBCXX_USE_CXX11_ABI set to 1, but I cannot seem to turn it off.

I also posted this to pytorch discuss (https://discuss.pytorch.org/t/setting-glibcxx-use-cxx11-abi-to-0/98054) since I thought it might be more relevant to pytorch. However, I also saw a few people with same error here (#205, #193) so I thought I'd ask here as well.

If you have any ideas what could be causing this, I would would really appreciate it.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions