Skip to content

CUDAAccelerator.num_cuda_devices() returns 0 while torch.cuda.device_count() returns 1 #14858

@ductai199x

Description

@ductai199x

First check

  • I'm sure this is a bug.
  • I've added a descriptive title to this bug.
  • I've provided clear instructions on how to reproduce the bug.
  • I've added a code sample.
  • I've provided any other important info that is required.

Bug description

CUDAAccelerator.num_cuda_devices() returns 0 while torch.cuda.device_count() returns 1. This causes the Trainer(accelerator="cuda", devices=1, ...) to get an error:

.../lib/python3.9/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

...

MisconfigurationException: CUDAAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into `accelerator` argument of `Trainer`: ['cpu'].

How to reproduce the bug

cuda_acc = CUDAAccelerator()
cuda_acc.auto_device_count() # gets 0

and

Trainer(accelerator="cuda", devices=1, ...) # gets error like above

Error messages and logs

.../lib/python3.9/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

and:

...
    527 if not self.accelerator.is_available():
    528     available_accelerator = [
    529         acc_str for acc_str in self._accelerator_types if AcceleratorRegistry.get(acc_str).is_available()
    530     ]
--> 531     raise MisconfigurationException(
    532         f"{self.accelerator.__class__.__qualname__} can not run on your system"
    533         " since the accelerator is not available. The following accelerator(s)"
    534         " is available and can be passed into `accelerator` argument of"
    535         f" `Trainer`: {available_accelerator}."
    536     )
    538 self._set_devices_flag_if_auto_passed()
    540 self._gpus = self._devices_flag if not self._gpus else self._gpus

MisconfigurationException: CUDAAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into `accelerator` argument of `Trainer`: ['cpu'].

Important info


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, CUDAAccelerator
#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- PyTorch Version (e.g., 1.10): 1.12.1+cu116
#- Python version (e.g., 3.9): 3.9
#- OS (e.g., Linux): Ubuntu 18.04
#- NVIDIA version: 515.65.01
#- CUDA version: 11.7
#- cuDNN version: 8.5.0.96.1+cuda11.7
#- GPU models and configuration: NVIDIA GeForce RTX 3090 (Similar problem detected on same system but with NVIDIA GeForce GTX 1080 Ti)
#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment: local

More info

I did dig around and what I found was that this function at .../lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py line 339:

def num_cuda_devices() -> int:
    """Returns the number of GPUs available.

    Unlike :func:`torch.cuda.device_count`, this function will do its best not to create a CUDA context for fork
    support, if the platform allows it.
    """
    if "fork" not in torch.multiprocessing.get_all_start_methods() or _is_forking_disabled():
        return torch.cuda.device_count()
    with multiprocessing.get_context("fork").Pool(1) as pool:
        return pool.apply(torch.cuda.device_count)

is the culprit. The if-statement returns false and apparently the multiprocessing code returns the incorrect answer. However, if I add this line torch.cuda.device_count() after the if-statement like this:

def num_cuda_devices() -> int:
    """Returns the number of GPUs available.

    Unlike :func:`torch.cuda.device_count`, this function will do its best not to create a CUDA context for fork
    support, if the platform allows it.
    """
    if "fork" not in torch.multiprocessing.get_all_start_methods() or _is_forking_disabled():
        return torch.cuda.device_count()
    torch.cuda.device_count()
    with multiprocessing.get_context("fork").Pool(1) as pool:
        return pool.apply(torch.cuda.device_count)

, then everything works correctly. I think it was some problem with the multiprocessing code forking the main process -> child process and the child process for some reason does not have access to cuda. Please take a look at this problem because while I found a work around, it may not be what's intended at the time when the code was written. Thank you.

Metadata

Metadata

Assignees

Labels

accelerator: cudaCompute Unified Device Architecture GPUbugSomething isn't working

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions