Skip to content

PyTorch Lightning 1.4.1 crashes during training #8821

@rishikanthc

Description

@rishikanthc

🐛 Bug

When I start training on 2 opus using pytorch-lightning 1.4.1 the training crashes after a few epochs. Note that this happens only on 1.4.1
If I run my code using pytorch-lightning 1.4.0 everything works fine.

There are multiple versions of the same error with different versions. For brevity I'm attaching just one trace.
Here's the error trace:

Global seed set to 20
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Files already downloaded and verified
Files already downloaded and verified
Global seed set to 20
Global seed set to 20
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Using native 16bit precision.
Global seed set to 20
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name     | Type             | Params
----------------------------------------------
0 | resnet18 | ResNet           | 11.2 M
1 | loss     | CrossEntropyLoss | 0
----------------------------------------------
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.881    Total estimated model params size (MB)
Global seed set to 20
Global seed set to 20
/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:322: UserWarning: The number of training samples (44) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 4:  47%|█████████████████████                        | 23/49 [00:02<00:02,  9.20it/s, loss=2.51, v_num=17, val_loss=3.260, val_acc=0.239, train_loss=2.760, train_acc=0.296]terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff9a6d3fa22 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10e9e (0x7ff9a6fa0e9e in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7ff9a6fa2147 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7ff9a6d295a4 in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2822a (0x7ffa4bb4722a in /home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x4efd28]
frame #6: python() [0x5fb977]
frame #7: python() [0x5ab432]
<omitting python frames>
frame #9: python() [0x4f34b2]
frame #10: python() [0x5a6eaa]
frame #25: python() [0x50b868]
frame #30: python() [0x59be64]
frame #31: python() [0x5a6f17]
frame #42: python() [0x59c16d]
frame #43: python() [0x5a6f17]
frame #49: python() [0x5a7031]
frame #50: python() [0x69e536]
frame #52: python() [0x5c3cb0]
frame #60: python() [0x5038a2]

Traceback (most recent call last):
  File "resnet18cifar.py", line 177, in <module>
    trainer.fit(model)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run_train
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 453, in reconciliate_processes
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
 Traceback (most recent call last):
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    self.fit_loop.run()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
    model_ref.optimizer_step(
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 292, in optimizer_step
    make_optimizer_step = self.precision_plugin.pre_optimizer_step(
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 59, in pre_optimizer_step
    result = lambda_closure()
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 547, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 588, in backward
    result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
    model.backward(closure_loss, optimizer, *args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1465, in backward
    loss.backward(*args, **kwargs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/home/bitwiz/codeden/pruning/env/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 11444) is killed by signal: Aborted.

To Reproduce

Here's my code.
It's a simple code which trains resnet18 on cifar using 2 gpus with DDP.

Expected behavior

It's supposed to train for 100 epochs and

Environment

* CUDA:
	- GPU:
		- RTX A5000
		- RTX A5000
	- available:         True
	- version:           11.1
* Packages:
	- numpy:             1.21.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.9.0+cu111
	- pytorch-lightning: 1.4.1
	- tqdm:              4.62.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.10
	- version:           #27~20.04.1-Ubuntu SMP Tue Jul 13 17:41:23 UTC 2021

Additional context

The error happens irrespective of whether I use DP or DDP

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topicpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions