Skip to content

Trying to create tensor with negative dimension with ddp_sharded #13431

@Riccorl

Description

@Riccorl

🐛 Bug

I'm trying to fine-tune a Transformer model (XLM-R) on multi-gpu, using the ddp_sharded strategy. The train works, but at the end of the first epoch I got this error

RuntimeError: Trying to create tensor with negative dimension -2061635393: [-2061635393]

I'm running the latest PyTorch Lightning, PyTorch 1.10, and I'm using two V100 on a Power9 based architecture. I've tried both with 16bit and 32bit precision. The optimizer I'm using is RAdam, from PyTorch.

I can provide the code if needed.

Here the complete stack trace
Traceback (most recent call last):
  File "transformers_ner/train.py", line 186, in main
    train(conf)
  File "transformers_ner/train.py", line 103, in train
    trainer.fit(pl_module, datamodule=pl_data_module)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in
_call_and_handle_interrupt
    self.strategy.reconciliate_processes(traceback.format_exc())
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 451, in
reconciliate_processes
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
 Traceback (most recent call last):
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in
_call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93,
in launch
    return function(*args, **kwargs)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in
_call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 308, in
on_train_epoch_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 379, in
_save_topk_checkpoint
    self._save_monitor_checkpoint(trainer, monitor_candidates)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 651, in
_save_monitor_checkpoint
    self._update_best_and_save(current, trainer, monitor_candidates)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 702, in
_update_best_and_save
    self._save_checkpoint(trainer, filepath)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 384, in
_save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2467, in save_checkpoint
    self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444,
in save_checkpoint
    _checkpoint = self.dump_checkpoint(weights_only)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 380,
in dump_checkpoint
    optimizer_state = self.trainer.strategy.optimizer_state(optimizer)
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/pytorch_lightning/strategies/sharded.py", line 117, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/fairscale/optim/oss.py", line 364, in consolidate_state_dict
    dist.broadcast_object_list(
  File "/m100/home/usertrain/a08trc0m/.conda/envs/ner/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1823, in
broadcast_object_list
    object_tensor = torch.empty(
RuntimeError: Trying to create tensor with negative dimension -2061635393: [-2061635393]

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.6.4
  • PyTorch Version (e.g., 1.10): 1.10
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): Linux
  • GPU models and configuration: V100
  • How you installed PyTorch (conda, pip, source): conda
  • Any other relevant information: Power9 Architecture

cc @SeanNaren @awaelchli @rohitgr7 @akihironitta

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions