-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
When using a metric such as Accuracy from pytorch_lightning.metrics in machine with 4 GPU and in 'dp' mode, there is an error due to accumulating the metric in different devices. In the case of Accuracy, in line:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c8ccec7a02c53ed38af6ef7193232426384eee4a/pytorch_lightning/metrics/classification/accuracy.py#L108
The arguments in torch.sum are in the same device the metric is been called from, but the self.correct is in a different one. The traceback is as follows:
self.accuracy_val(y_hat, y)
File "/home/***/.conda/envs/***/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/***/.conda/envs/***/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 153, in forward
self.update(*args, **kwargs)
File "/home/***/.conda/envs/***/lib/python3.8/site-packages/pytorch_lightning/metrics/metric.py", line 199, in wrapped_func
return update(*args, **kwargs)
File "/home/***/.conda/envs/***/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 109, in update
self.correct += torch.sum(preds == target)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Please reproduce using the BoringModel and post here
https://colab.research.google.com/drive/1zcU1ADuHZj82clrBysv-EGfgqG7SxUhN#scrollTo=V7ELesz1kVQo
To Reproduce
The shared colab is not going to be able to replicate the bug since it needs 'dp' on multiple gpus, but it should give an idea of when the error occurs. So setting
num_gpus=4,
accelerator="dp",
in the Trainer and then using a metric should bring up the issue. I have tested it with Accuracy but other users in the Slack channel reported it for other metrics such as Precision or Recall.
Expected behavior
The devices should be the same when the values are added together. I am not sure of which would be the correct approach, I have "brutely" solved it by:
self.correct += torch.sum(preds.cuda(self.correct.device.index) == target.cuda(self.correct.device.index))
self.total += target.cuda(self.correct.device.index).numel()
in the case of Accuracy, but that is quite an ugly way of dealing with it.
Update: Although this doesn't produce the error, the accuracy is not properly computed, as values get reset to 0 for some reason between steps.
Environment
- CUDA:
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 10.2 - Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.0.3
- tqdm: 4.50.2 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor:
- python: 3.8.5
- version: Proposal for help #1 SMP Debian 4.19.152-1 (2020-10-18)