Skip to content

trainer.fit() stuck with accelerator set to "ddp" #5961

@ifsheldon

Description

@ifsheldon

🐛 Bug

The problem is that trainer.fit() with accelerator set to ddp takes extremely long time to do something before it can get CPUs and GPUs working. And I cannot interrupt the kernel but have to restart it.

Please reproduce using the BoringModel

To Reproduce

I tried the Boring Model, and I can reproduce the issue.

The only modification I made is in Define the test section. The code is below

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='ddp' # added to use ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

And my code that initially encountered this issue is in the discussion post.

Expected behavior

The expected behavior is that the training should start in a couple of minutes, but trainer.fit() is stuck while GPUs and CPUs stay idle.

Environment

My environment is below as detected by the official python script. I run my code on a shared GPU cluster after I apply for computation resources. I usually apply for 512GB memory, 32 cores and 4 V100. The environment is managed by my personal conda without messing with others' environment. If you want to know more about the configuration, just let me know.

(torch) [liangf@gpu208-14 liangf]$ python collect_env_details.py
* CUDA:
        - GPU:
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
        - available:         True
        - version:           11.0
* Packages:
        - numpy:             1.20.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.7.1
        - pytorch-lightning: 1.1.8
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.1
        - version:           #1 SMP Tue Nov 17 13:59:11 UTC 2020

Additional context

If I changed the code above in the boring model to the below, the trainer "works" as expected. The trainer with accelerator='dp' takes less than 1min to get everything set up and keeps CPUs and GPUs busy while the one with accelerator='ddp' takes 10min and more and does not successfully get things running before I lose my patience.

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='dp' # changed to use dp instead of ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment. However, I don't know what is the cause, since I am just transferring from ordinary pytorch to pytorch-lightning, and the code in the notebook looks reasonably good for me.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-9-8b8914eff5a4> in test_x(tmpdir)
     12 
     13     # Train the model ⚡
---> 14     trainer.fit(model, train, val)
     15 
     16     trainer.test(test_dataloaders=test)

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    508         self.call_hook('on_fit_start')
    509 
--> 510         results = self.accelerator_backend.train()
    511         self.accelerator_backend.teardown()
    512 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train(self)
     55     def train(self):
     56         self.trainer.setup_trainer(self.trainer.model)
---> 57         return self.train_or_test()
     58 
     59     def teardown(self):

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     72         else:
     73             self.trainer.train_loop.setup_training()
---> 74             results = self.trainer.train()
     75         return results
     76 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    559                 with self.profiler.profile("run_training_epoch"):
    560                     # run train epoch
--> 561                     self.train_loop.run_training_epoch()
    562 
    563                 if self.max_steps and self.max_steps <= self.global_step:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    548             # ------------------------------------
    549             with self.trainer.profiler.profile("run_training_batch"):
--> 550                 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    551 
    552             # when returning -1 from train_step, we end epoch early

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    716 
    717                         # optimizer step
--> 718                         self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    719 
    720                     else:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    483 
    484         # model hook
--> 485         model_ref.optimizer_step(
    486             self.trainer.current_epoch,
    487             batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs)
   1296 
   1297         """
-> 1298         optimizer.step(closure=optimizer_closure)
   1299 
   1300     def optimizer_zero_grad(

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in step(self, closure, make_optimizer_step, *args, **kwargs)
    284 
    285         if make_optimizer_step:
--> 286             self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
    287         else:
    288             # make sure to call optimizer_closure when accumulating

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in __optimizer_step(self, closure, profiler_name, *args, **kwargs)
    142         else:
    143             with trainer.profiler.profile(profiler_name):
--> 144                 optimizer.step(closure=closure, *args, **kwargs)
    145 
    146         accelerator_backend = trainer.accelerator_backend

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs)
     65                 instance._step_count += 1
     66                 wrapped = func.__get__(instance, cls)
---> 67                 return wrapped(*args, **kwargs)
     68 
     69             # Note that the returned function here is no longer a bound method,

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     24         def decorate_context(*args, **kwargs):
     25             with self.__class__():
---> 26                 return func(*args, **kwargs)
     27         return cast(F, decorate_context)
     28 

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/sgd.py in step(self, closure)
     84         if closure is not None:
     85             with torch.enable_grad():
---> 86                 loss = closure()
     87 
     88         for group in self.param_groups:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in train_step_and_backward_closure()
    706 
    707                         def train_step_and_backward_closure():
--> 708                             result = self.training_step_and_backward(
    709                                 split_batch,
    710                                 batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    814                 # backward pass
    815                 with self.trainer.profiler.profile("model_backward"):
--> 816                     self.backward(result, optimizer, opt_idx)
    817 
    818                 # hook - call this hook only

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in backward(self, result, optimizer, opt_idx, *args, **kwargs)
    840             self.trainer.accelerator_backend.backward(result, optimizer, opt_idx, *args, **kwargs)
    841         else:
--> 842             result.closure_loss = self.trainer.accelerator_backend.backward(
    843                 result.closure_loss, optimizer, opt_idx, *args, **kwargs
    844             )

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs)
    107             # do backward pass
    108             model = self.trainer.get_model()
--> 109             model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
    110 
    111             # once backward has been applied, release graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
   1160         """
   1161         if self.trainer.train_loop.automatic_optimization or self._running_manual_backward:
-> 1162             loss.backward(*args, **kwargs)
   1163 
   1164     def toggle_optimizer(self, optimizer: Optimizer, optimizer_idx: int):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    124 
    125     grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
--> 126     grad_tensors_ = _make_grads(tensors, grad_tensors_)
    127     if retain_graph is None:
    128         retain_graph = create_graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in _make_grads(outputs, grads)
     48             if out.requires_grad:
     49                 if out.numel() != 1:
---> 50                     raise RuntimeError("grad can be implicitly created only for scalar outputs")
     51                 new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
     52             else:

RuntimeError: grad can be implicitly created only for scalar outputs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions