-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
After finishing running my trainer.fit(...) and trainer.test(...), the program should exit normally. But instead, I get a resource warning that some resources are still in use.
To Reproduce
Run a script using accelerator="ddp" and run with PYTHONTRACEMALLOC=1 python <script_name>.py --gpus 4.
Expected behavior
The resources should be freed or the processes ended gracefully.
Environment
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Quadro RTX 8000
GPU 1: Quadro RTX 8000
GPU 2: Quadro RTX 8000
GPU 3: Quadro RTX 8000
Nvidia driver version: 460.39
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.3.0rc1
[pip3] torch==1.8.1
[pip3] torchmetrics==0.3.0rc0
[pip3] torchvision==0.9.1
[pip3] vit-pytorch==0.6.7
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.1 h6406543_8 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge
[conda] mkl_fft 1.3.0 py38h5c078b8_1 conda-forge
[conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.5_0 pytorch
[conda] pytorch-lightning 1.3.0rc1 pypi_0 pypi
[conda] torchmetrics 0.3.0rc0 pypi_0 pypi
[conda] torchvision 0.9.1 py38_cu111 pytorch
[conda] vit-pytorch 0.6.7 pypi_0 pypi
Additional context
The traceback:
/home/ndalton/miniconda3/envs/cvos/lib/python3.8/subprocess.py:942: ResourceWarning: subprocess 3670473 is still running
_warn("subprocess %s is still running" % self.pid,
Object allocated at (most recent call last):
File "/home/ndalton/miniconda3/envs/cvos/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", lineno 173
proc = subprocess.Popen(command, env=env_copy, cwd=cwd)
Which is repeated for every GPU - so if gpus=4 is set, then this traceback will be shown 4 times, each with different PID of course.