-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
First check
- I'm sure this is a bug.
- I've added a descriptive title to this bug.
- I've provided clear instructions on how to reproduce the bug.
- I've added a code sample.
- I've provided any other important info that is required.
Bug description
CUDAAccelerator.num_cuda_devices() returns 0 while torch.cuda.device_count() returns 1. This causes the Trainer(accelerator="cuda", devices=1, ...) to get an error:
.../lib/python3.9/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
...
MisconfigurationException: CUDAAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into `accelerator` argument of `Trainer`: ['cpu'].
How to reproduce the bug
cuda_acc = CUDAAccelerator()
cuda_acc.auto_device_count() # gets 0
and
Trainer(accelerator="cuda", devices=1, ...) # gets error like aboveError messages and logs
.../lib/python3.9/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
and:
...
527 if not self.accelerator.is_available():
528 available_accelerator = [
529 acc_str for acc_str in self._accelerator_types if AcceleratorRegistry.get(acc_str).is_available()
530 ]
--> 531 raise MisconfigurationException(
532 f"{self.accelerator.__class__.__qualname__} can not run on your system"
533 " since the accelerator is not available. The following accelerator(s)"
534 " is available and can be passed into `accelerator` argument of"
535 f" `Trainer`: {available_accelerator}."
536 )
538 self._set_devices_flag_if_auto_passed()
540 self._gpus = self._devices_flag if not self._gpus else self._gpus
MisconfigurationException: CUDAAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into `accelerator` argument of `Trainer`: ['cpu'].
Important info
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, CUDAAccelerator
#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- PyTorch Version (e.g., 1.10): 1.12.1+cu116
#- Python version (e.g., 3.9): 3.9
#- OS (e.g., Linux): Ubuntu 18.04
#- NVIDIA version: 515.65.01
#- CUDA version: 11.7
#- cuDNN version: 8.5.0.96.1+cuda11.7
#- GPU models and configuration: NVIDIA GeForce RTX 3090 (Similar problem detected on same system but with NVIDIA GeForce GTX 1080 Ti)
#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment: local
More info
I did dig around and what I found was that this function at .../lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py line 339:
def num_cuda_devices() -> int:
"""Returns the number of GPUs available.
Unlike :func:`torch.cuda.device_count`, this function will do its best not to create a CUDA context for fork
support, if the platform allows it.
"""
if "fork" not in torch.multiprocessing.get_all_start_methods() or _is_forking_disabled():
return torch.cuda.device_count()
with multiprocessing.get_context("fork").Pool(1) as pool:
return pool.apply(torch.cuda.device_count)
is the culprit. The if-statement returns false and apparently the multiprocessing code returns the incorrect answer. However, if I add this line torch.cuda.device_count() after the if-statement like this:
def num_cuda_devices() -> int:
"""Returns the number of GPUs available.
Unlike :func:`torch.cuda.device_count`, this function will do its best not to create a CUDA context for fork
support, if the platform allows it.
"""
if "fork" not in torch.multiprocessing.get_all_start_methods() or _is_forking_disabled():
return torch.cuda.device_count()
torch.cuda.device_count()
with multiprocessing.get_context("fork").Pool(1) as pool:
return pool.apply(torch.cuda.device_count)
, then everything works correctly. I think it was some problem with the multiprocessing code forking the main process -> child process and the child process for some reason does not have access to cuda. Please take a look at this problem because while I found a work around, it may not be what's intended at the time when the code was written. Thank you.