Skip to content

DDP does not clean up the processes it makes #6994

@ndalton12

Description

@ndalton12

🐛 Bug

After finishing running my trainer.fit(...) and trainer.test(...), the program should exit normally. But instead, I get a resource warning that some resources are still in use.

To Reproduce

Run a script using accelerator="ddp" and run with PYTHONTRACEMALLOC=1 python <script_name>.py --gpus 4.

Expected behavior

The resources should be freed or the processes ended gracefully.

Environment

PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Quadro RTX 8000
GPU 1: Quadro RTX 8000
GPU 2: Quadro RTX 8000
GPU 3: Quadro RTX 8000

Nvidia driver version: 460.39
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.3.0rc1
[pip3] torch==1.8.1
[pip3] torchmetrics==0.3.0rc0
[pip3] torchvision==0.9.1
[pip3] vit-pytorch==0.6.7
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.1 h6406543_8 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge
[conda] mkl_fft 1.3.0 py38h5c078b8_1 conda-forge
[conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.5_0 pytorch
[conda] pytorch-lightning 1.3.0rc1 pypi_0 pypi
[conda] torchmetrics 0.3.0rc0 pypi_0 pypi
[conda] torchvision 0.9.1 py38_cu111 pytorch
[conda] vit-pytorch 0.6.7 pypi_0 pypi

Additional context

The traceback:

/home/ndalton/miniconda3/envs/cvos/lib/python3.8/subprocess.py:942: ResourceWarning: subprocess 3670473 is still running
  _warn("subprocess %s is still running" % self.pid,
Object allocated at (most recent call last):
  File "/home/ndalton/miniconda3/envs/cvos/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", lineno 173
    proc = subprocess.Popen(command, env=env_copy, cwd=cwd)

Which is repeated for every GPU - so if gpus=4 is set, then this traceback will be shown 4 times, each with different PID of course.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions