-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicpriority: 0High priority taskHigh priority task
Description
🐛 Bug
When I start training on 2 opus using pytorch-lightning 1.4.1 the training crashes after a few epochs. Note that this happens only on 1.4.1
If I run my code using pytorch-lightning 1.4.0 everything works fine.
There are multiple versions of the same error with different versions. For brevity I'm attaching just one trace.
Here's the error trace:
Global seed set to 20
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Files already downloaded and verified
Files already downloaded and verified
Global seed set to 20
Global seed set to 20
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Using native 16bit precision.
Global seed set to 20
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
----------------------------------------------
0 | resnet18 | ResNet | 11.2 M
1 | loss | CrossEntropyLoss | 0
----------------------------------------------
11.2 M Trainable params
0 Non-trainable params
11.2 M Total params
44.881 Total estimated model params size (MB)
Global seed set to 20
Global seed set to 20
/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:322: UserWarning: The number of training samples (44) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
rank_zero_warn(
Epoch 4: 47%|█████████████████████ | 23/49 [00:02<00:02, 9.20it/s, loss=2.51, v_num=17, val_loss=3.260, val_acc=0.239, train_loss=2.760, train_acc=0.296]terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff9a6d3fa22 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10e9e (0x7ff9a6fa0e9e in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7ff9a6fa2147 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7ff9a6d295a4 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2822a (0x7ffa4bb4722a in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x4efd28]
frame #6: python() [0x5fb977]
frame #7: python() [0x5ab432]
<omitting python frames>
frame #9: python() [0x4f34b2]
frame #10: python() [0x5a6eaa]
frame #25: python() [0x50b868]
frame #30: python() [0x59be64]
frame #31: python() [0x5a6f17]
frame #42: python() [0x59c16d]
frame #43: python() [0x5a6f17]
frame #49: python() [0x5a7031]
frame #50: python() [0x69e536]
frame #52: python() [0x5c3cb0]
frame #60: python() [0x5038a2]
Traceback (most recent call last):
File "resnet18cifar.py", line 177, in <module>
trainer.fit(model)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run_train
self.training_type_plugin.reconciliate_processes(traceback.format_exc())
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 453, in reconciliate_processes
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
Traceback (most recent call last):
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
super().run(batch, batch_idx, dataloader_idx)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
model_ref.optimizer_step(
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 292, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 59, in pre_optimizer_step
result = lambda_closure()
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 547, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 588, in backward
result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
model.backward(closure_loss, optimizer, *args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1465, in backward
loss.backward(*args, **kwargs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 11444) is killed by signal: Aborted.To Reproduce
Here's my code.
It's a simple code which trains resnet18 on cifar using 2 gpus with DDP.
Expected behavior
It's supposed to train for 100 epochs and
Environment
* CUDA:
- GPU:
- RTX A5000
- RTX A5000
- available: True
- version: 11.1
* Packages:
- numpy: 1.21.1
- pyTorch_debug: False
- pyTorch_version: 1.9.0+cu111
- pytorch-lightning: 1.4.1
- tqdm: 4.62.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.10
- version: #27~20.04.1-Ubuntu SMP Tue Jul 13 17:41:23 UTC 2021Additional context
The error happens irrespective of whether I use DP or DDP
dhkim0225, stonelazy, HMJiangGatech, kingmbc, Gateway2745 and 4 moredhkim0225chrockey and priancho
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topicpriority: 0High priority taskHigh priority task