Skip to content

Adagrad not working with GPU and DDP #6824

@qianlivia

Description

@qianlivia

🐛 Bug

Adagrad doesn't work with GPUs and DDP as the optimizer is created before the model is moved to CUDA. I believe this issue has been addressed in an earlier version: #554

How to reproduce using the BoringModel

https://colab.research.google.com/drive/1HfyL5htoOkPETggTLwYNfh94HrNc6TOS?usp=sharing

The error emerged when I tried using Adagrad with both one and multiple GPUs.

Stack trace

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66    
---------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                          | 0/314 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rajmund/test.py", line 118, in <module>
    test_x(tmpdir)
  File "/home/rajmund/test.py", line 110, in test_x
    trainer.fit(model, train, val)
Traceback (most recent call last):
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
  File "test.py", line 118, in <module>
    test_x(tmpdir)
  File "test.py", line 110, in test_x
    trainer.fit(model, train, val)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.dispatch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.accelerator.start_training(self)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.training_type_plugin.start_training(trainer)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
    self.train_loop.run_training_epoch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 433, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
    optimizer.step(closure=optimizer_closure)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 433, in optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
    using_lbfgs=is_lbfgs,
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    optimizer.step(closure=optimizer_closure)
    return wrapped(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
    return func(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/adagrad.py", line 90, in step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
    group['eps'])
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/functional.py", line 48, in adagrad
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/adagrad.py", line 90, in step
    state_sum.addcmul_(grad, grad, value=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
    group['eps'])
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/functional.py", line 48, in adagrad
    state_sum.addcmul_(grad, grad, value=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

Environment

  • PyTorch Version: 1.7.1
  • PyTorch Lightning: 1.2.6
  • OS: Linux
  • How you installed PyTorch: pip
  • Python version: 3.6
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: Titan xp
  • Any other relevant information: -

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions