Skip to content

Checkpoint issue when using Horovod distributed backend #6947

@liob

Description

@liob

🐛 Bug

I run into an issue if I try to keep the top k models (save_top_k) using a checkpoint if Horovod is enabled as distributed backend. It appears that the reduce sum operation is misimplemented.

Although the minimal example is run on a single node, the same error appears in a multi-node setup. It makes no difference, if GPU support is enabled. If Horovod is disabled, the code behaves as expected.

To Reproduce

I have created a minimal example to reproduce the issue: Colab

Traceback

Traceback (most recent call last):
  File "./bugreport.py", line 50, in <module>
    trainer.fit(model, train_data, val_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/horovod.py", line 99, in start_training
    self._results = trainer.run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 669, in run_train
    self.train_loop.on_train_end()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
    self.check_checkpoint_callback(should_update=True, is_last=True)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
    cb.on_validation_end(self.trainer, model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
    self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 562, in _save_top_k_checkpoints
    if self.check_monitor_top_k(trainer, current):
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in check_monitor_top_k
    should_update_best_and_save = trainer.training_type_plugin.reduce_boolean_decision(should_update_best_and_save)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/parallel.py", line 86, in reduce_boolean_decision
    decision = self.reduce(decision, reduce_op=ReduceOp.SUM)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/horovod.py", line 144, in reduce
    raise ValueError(f"unrecognized `reduce_op`: {reduce_op}")
ValueError: unrecognized `reduce_op`: ReduceOp.SUM
Exception ignored in: <function tqdm.__del__ at 0x7f65d0a23280>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1128, in __del__
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1131, in __repr__
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict

Environment

Local Setup

  • CUDA:
    - GPU:
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - Tesla V100-SXM2-32GB
    - available: True
    - version: 11.1
  • Packages:
    - numpy: 1.19.2
    - pyTorch_debug: False
    - pyTorch_version: 1.8.0
    - pytorch-lightning: 1.2.6
    - tqdm: 4.51.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.8.8
    - version: 0.4.0 release - final checks (releasing later today) #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019

Colab

  • CUDA:
    • GPU:
      • Tesla K80
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.8.1+cu101
    • pytorch-lightning: 1.2.7
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.10
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked onpriority: 1Medium priority task

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions