-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
Milestone
Description
🐛 Bug
Adagrad doesn't work with GPUs and DDP as the optimizer is created before the model is moved to CUDA. I believe this issue has been addressed in an earlier version: #554
How to reproduce using the BoringModel
https://colab.research.google.com/drive/1HfyL5htoOkPETggTLwYNfh94HrNc6TOS?usp=sharing
The error emerged when I tried using Adagrad with both one and multiple GPUs.
Stack trace
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
| Name | Type | Params
---------------------------------
0 | layer | Linear | 66
---------------------------------
66 Trainable params
0 Non-trainable params
66 Total params
0.000 Total estimated model params size (MB)
/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Epoch 0: 0%| | 0/314 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/rajmund/test.py", line 118, in <module>
test_x(tmpdir)
File "/home/rajmund/test.py", line 110, in test_x
trainer.fit(model, train, val)
Traceback (most recent call last):
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
File "test.py", line 118, in <module>
test_x(tmpdir)
File "test.py", line 110, in test_x
trainer.fit(model, train, val)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.dispatch()
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.accelerator.start_training(self)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.training_type_plugin.start_training(trainer)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.train_loop.run_training_epoch()
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
self.train_loop.run_training_epoch()
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 433, in optimizer_step
using_lbfgs=is_lbfgs,
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
optimizer.step(closure=optimizer_closure)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 433, in optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
using_lbfgs=is_lbfgs,
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
optimizer.step(closure=optimizer_closure)
return wrapped(*args, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
return func(*args, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/adagrad.py", line 90, in step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
group['eps'])
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/functional.py", line 48, in adagrad
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/adagrad.py", line 90, in step
state_sum.addcmul_(grad, grad, value=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
group['eps'])
File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/functional.py", line 48, in adagrad
state_sum.addcmul_(grad, grad, value=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
Environment
- PyTorch Version: 1.7.1
- PyTorch Lightning: 1.2.6
- OS: Linux
- How you installed PyTorch: pip
- Python version: 3.6
- CUDA/cuDNN version: 10.1
- GPU models and configuration: Titan xp
- Any other relevant information: -
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task