-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
I have run into many distributed training now that have gone completely frozen after enabling the RichProgressBar callback.
It gets stuck in between epochs after it finished one. Sometimes it's after the first one, sometimes after the second one, sometimes after even 9 epochs.
The weird thing is that if I do Ctrl+C, the program doesn't interrupt. I have to kill it with SIGKILL (kill -9) because SIGTERM doesn't do it. So I don't know the stack trace. I'm also inside a Docker container, so I can't do strace (I don't have access to the host computer). Any help here to check the stack trace is welcome.
To Reproduce
I'm sorry but I can't share my code. All I can say is that I use a machine with 8 GPUs and the ddp_find_unused_parameters_false strategy, and that the problem only appears with RichProgressBar and that otherwise it doesn't.
I'd be happy to provide more details or try stuff out. Ask me for it! But I can't share my code, I'm sorry.
Expected behavior
The training to continue successfully.
Environment
- PyTorch Lightning Version (e.g., 1.3.0): 1.5.0
- PyTorch Version (e.g., 1.8): 1.10.0
- Python version: 3.8.12
- OS (e.g., Linux): Linux
- CUDA/cuDNN version: 11.3
- GPU models and configuration: 8x A100
- How you installed PyTorch (
conda,pip, source): conda - If compiling from source, the output of
torch.__config__.show(): - Any other relevant information: