-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
The problem is that trainer.fit() with accelerator set to ddp takes extremely long time to do something before it can get CPUs and GPUs working. And I cannot interrupt the kernel but have to restart it.
Please reproduce using the BoringModel
To Reproduce
I tried the Boring Model, and I can reproduce the issue.
The only modification I made is in Define the test section. The code is below
def test_x(tmpdir):
# init model
model = BoringModel()
# Initialize a trainer
trainer = pl.Trainer(
max_epochs=1,
progress_bar_refresh_rate=20,
gpus = 4, # added to use 4 gpus
accelerator='ddp' # added to use ddp
)
# Train the model ⚡
trainer.fit(model, train, val)
trainer.test(test_dataloaders=test)And my code that initially encountered this issue is in the discussion post.
Expected behavior
The expected behavior is that the training should start in a couple of minutes, but trainer.fit() is stuck while GPUs and CPUs stay idle.
Environment
My environment is below as detected by the official python script. I run my code on a shared GPU cluster after I apply for computation resources. I usually apply for 512GB memory, 32 cores and 4 V100. The environment is managed by my personal conda without messing with others' environment. If you want to know more about the configuration, just let me know.
(torch) [liangf@gpu208-14 liangf]$ python collect_env_details.py
* CUDA:
- GPU:
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- available: True
- version: 11.0
* Packages:
- numpy: 1.20.0
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.8
- tqdm: 4.56.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.1
- version: #1 SMP Tue Nov 17 13:59:11 UTC 2020
Additional context
If I changed the code above in the boring model to the below, the trainer "works" as expected. The trainer with accelerator='dp' takes less than 1min to get everything set up and keeps CPUs and GPUs busy while the one with accelerator='ddp' takes 10min and more and does not successfully get things running before I lose my patience.
def test_x(tmpdir):
# init model
model = BoringModel()
# Initialize a trainer
trainer = pl.Trainer(
max_epochs=1,
progress_bar_refresh_rate=20,
gpus = 4, # added to use 4 gpus
accelerator='dp' # changed to use dp instead of ddp
)
# Train the model ⚡
trainer.fit(model, train, val)
trainer.test(test_dataloaders=test)By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment. However, I don't know what is the cause, since I am just transferring from ordinary pytorch to pytorch-lightning, and the code in the notebook looks reasonably good for me.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)
<ipython-input-9-8b8914eff5a4> in test_x(tmpdir)
12
13 # Train the model ⚡
---> 14 trainer.fit(model, train, val)
15
16 trainer.test(test_dataloaders=test)
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
508 self.call_hook('on_fit_start')
509
--> 510 results = self.accelerator_backend.train()
511 self.accelerator_backend.teardown()
512
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train(self)
55 def train(self):
56 self.trainer.setup_trainer(self.trainer.model)
---> 57 return self.train_or_test()
58
59 def teardown(self):
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
72 else:
73 self.trainer.train_loop.setup_training()
---> 74 results = self.trainer.train()
75 return results
76
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
559 with self.profiler.profile("run_training_epoch"):
560 # run train epoch
--> 561 self.train_loop.run_training_epoch()
562
563 if self.max_steps and self.max_steps <= self.global_step:
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
548 # ------------------------------------
549 with self.trainer.profiler.profile("run_training_batch"):
--> 550 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
551
552 # when returning -1 from train_step, we end epoch early
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
716
717 # optimizer step
--> 718 self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
719
720 else:
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
483
484 # model hook
--> 485 model_ref.optimizer_step(
486 self.trainer.current_epoch,
487 batch_idx,
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs)
1296
1297 """
-> 1298 optimizer.step(closure=optimizer_closure)
1299
1300 def optimizer_zero_grad(
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in step(self, closure, make_optimizer_step, *args, **kwargs)
284
285 if make_optimizer_step:
--> 286 self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
287 else:
288 # make sure to call optimizer_closure when accumulating
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in __optimizer_step(self, closure, profiler_name, *args, **kwargs)
142 else:
143 with trainer.profiler.profile(profiler_name):
--> 144 optimizer.step(closure=closure, *args, **kwargs)
145
146 accelerator_backend = trainer.accelerator_backend
~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs)
65 instance._step_count += 1
66 wrapped = func.__get__(instance, cls)
---> 67 return wrapped(*args, **kwargs)
68
69 # Note that the returned function here is no longer a bound method,
~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
24 def decorate_context(*args, **kwargs):
25 with self.__class__():
---> 26 return func(*args, **kwargs)
27 return cast(F, decorate_context)
28
~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/sgd.py in step(self, closure)
84 if closure is not None:
85 with torch.enable_grad():
---> 86 loss = closure()
87
88 for group in self.param_groups:
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in train_step_and_backward_closure()
706
707 def train_step_and_backward_closure():
--> 708 result = self.training_step_and_backward(
709 split_batch,
710 batch_idx,
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
814 # backward pass
815 with self.trainer.profiler.profile("model_backward"):
--> 816 self.backward(result, optimizer, opt_idx)
817
818 # hook - call this hook only
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in backward(self, result, optimizer, opt_idx, *args, **kwargs)
840 self.trainer.accelerator_backend.backward(result, optimizer, opt_idx, *args, **kwargs)
841 else:
--> 842 result.closure_loss = self.trainer.accelerator_backend.backward(
843 result.closure_loss, optimizer, opt_idx, *args, **kwargs
844 )
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs)
107 # do backward pass
108 model = self.trainer.get_model()
--> 109 model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
110
111 # once backward has been applied, release graph
~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
1160 """
1161 if self.trainer.train_loop.automatic_optimization or self._running_manual_backward:
-> 1162 loss.backward(*args, **kwargs)
1163
1164 def toggle_optimizer(self, optimizer: Optimizer, optimizer_idx: int):
~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
219 retain_graph=retain_graph,
220 create_graph=create_graph)
--> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph)
222
223 def register_hook(self, hook):
~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
124
125 grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
--> 126 grad_tensors_ = _make_grads(tensors, grad_tensors_)
127 if retain_graph is None:
128 retain_graph = create_graph
~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in _make_grads(outputs, grads)
48 if out.requires_grad:
49 if out.numel() != 1:
---> 50 raise RuntimeError("grad can be implicitly created only for scalar outputs")
51 new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
52 else:
RuntimeError: grad can be implicitly created only for scalar outputs