-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingstrategy: fairscale sharded (removed)Sharded Data ParallelSharded Data Parallel
Description
🐛 Bug
I'm trying to fine-tune a Transformer model (XLM-R) on multi-gpu, using the ddp_sharded strategy. The train works, but at the end of the first epoch I got this error
RuntimeError: Trying to create tensor with negative dimension -2061635393: [-2061635393]
I'm running the latest PyTorch Lightning, PyTorch 1.10, and I'm using two V100 on a Power9 based architecture. I've tried both with 16bit and 32bit precision. The optimizer I'm using is RAdam, from PyTorch.
I can provide the code if needed.
Here the complete stack trace
Traceback (most recent call last):
File "transformers_ner/train.py", line 186, in main
train(conf)
File "transformers_ner/train.py", line 103, in train
trainer.fit(pl_module, datamodule=pl_data_module)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in
_call_and_handle_interrupt
self.strategy.reconciliate_processes(traceback.format_exc())
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 451, in
reconciliate_processes
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
Traceback (most recent call last):
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in
_call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93,
in launch
return function(*args, **kwargs)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in
_call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 308, in
on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 379, in
_save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 651, in
_save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 702, in
_update_best_and_save
self._save_checkpoint(trainer, filepath)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 384, in
_save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2467, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444,
in save_checkpoint
_checkpoint = self.dump_checkpoint(weights_only)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 380,
in dump_checkpoint
optimizer_state = self.trainer.strategy.optimizer_state(optimizer)
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/strategies/sharded.py", line 117, in optimizer_state
optimizer.consolidate_state_dict()
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/fairscale/optim/oss.py", line 364, in consolidate_state_dict
dist.broadcast_object_list(
File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1823, in
broadcast_object_list
object_tensor = torch.empty(
RuntimeError: Trying to create tensor with negative dimension -2061635393: [-2061635393]
Environment
- PyTorch Lightning Version (e.g., 1.5.0): 1.6.4
- PyTorch Version (e.g., 1.10): 1.10
- Python version (e.g., 3.9): 3.8
- OS (e.g., Linux): Linux
- GPU models and configuration: V100
- How you installed PyTorch (
conda,pip, source): conda - Any other relevant information: Power9 Architecture
shubham-goel and yashkant
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstrategy: fairscale sharded (removed)Sharded Data ParallelSharded Data Parallel