Checkpoint issue when using Horovod distributed backend

## 🐛 Bug
I run into an issue if I try to keep the top k models (save_top_k) using a checkpoint if Horovod is enabled as distributed backend. It appears that the reduce sum operation is misimplemented.

Although the minimal example is run on a single node, the same error appears in a multi-node setup. It makes no difference, if GPU support is enabled. If Horovod is disabled, the code behaves as expected.


### To Reproduce
I have created a minimal example to reproduce the issue: [Colab]


### Traceback
```
Traceback (most recent call last):
  File "./bugreport.py", line 50, in <module>
    trainer.fit(model, train_data, val_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/horovod.py", line 99, in start_training
    self._results = trainer.run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 669, in run_train
    self.train_loop.on_train_end()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
    self.check_checkpoint_callback(should_update=True, is_last=True)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
    cb.on_validation_end(self.trainer, model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
    self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 562, in _save_top_k_checkpoints
    if self.check_monitor_top_k(trainer, current):
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in check_monitor_top_k
    should_update_best_and_save = trainer.training_type_plugin.reduce_boolean_decision(should_update_best_and_save)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/parallel.py", line 86, in reduce_boolean_decision
    decision = self.reduce(decision, reduce_op=ReduceOp.SUM)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/horovod.py", line 144, in reduce
    raise ValueError(f"unrecognized `reduce_op`: {reduce_op}")
ValueError: unrecognized `reduce_op`: ReduceOp.SUM
Exception ignored in: <function tqdm.__del__ at 0x7f65d0a23280>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1128, in __del__
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1131, in __repr__
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict
```


### Environment
#### Local Setup
* CUDA:
         - GPU:
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
        - available:         True
        - version:           11.1
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.8.0
        - pytorch-lightning: 1.2.6
        - tqdm:              4.51.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.8.8
        - version:           #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019

#### Colab
* CUDA:
	- GPU:
		- Tesla K80
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.8.1+cu101
	- pytorch-lightning: 1.2.7
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Thu Jul 23 08:00:38 PDT 2020


[Colab]: https://colab.research.google.com/drive/1fhU4sRMam1KyN7Q5qqLXwUFuEBOsXNUs?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpoint issue when using Horovod distributed backend #6947

🐛 Bug

To Reproduce

Traceback

Environment

Local Setup

Colab

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint issue when using Horovod distributed backend #6947

Description

🐛 Bug

To Reproduce

Traceback

Environment

Local Setup

Colab

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions