Skip to content

RichProgressBar deadlocks distributed training #10362

@bryant1410

Description

@bryant1410

🐛 Bug

I have run into many distributed training now that have gone completely frozen after enabling the RichProgressBar callback.

It gets stuck in between epochs after it finished one. Sometimes it's after the first one, sometimes after the second one, sometimes after even 9 epochs.

The weird thing is that if I do Ctrl+C, the program doesn't interrupt. I have to kill it with SIGKILL (kill -9) because SIGTERM doesn't do it. So I don't know the stack trace. I'm also inside a Docker container, so I can't do strace (I don't have access to the host computer). Any help here to check the stack trace is welcome.

To Reproduce

I'm sorry but I can't share my code. All I can say is that I use a machine with 8 GPUs and the ddp_find_unused_parameters_false strategy, and that the problem only appears with RichProgressBar and that otherwise it doesn't.

I'd be happy to provide more details or try stuff out. Ask me for it! But I can't share my code, I'm sorry.

Expected behavior

The training to continue successfully.

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.5.0
  • PyTorch Version (e.g., 1.8): 1.10.0
  • Python version: 3.8.12
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: 8x A100
  • How you installed PyTorch (conda, pip, source): conda
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions