-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Milestone
Description
🐛 Bug
I run into an issue if I try to keep the top k models (save_top_k) using a checkpoint if Horovod is enabled as distributed backend. It appears that the reduce sum operation is misimplemented.
Although the minimal example is run on a single node, the same error appears in a multi-node setup. It makes no difference, if GPU support is enabled. If Horovod is disabled, the code behaves as expected.
To Reproduce
I have created a minimal example to reproduce the issue: Colab
Traceback
Traceback (most recent call last):
File "./bugreport.py", line 50, in <module>
trainer.fit(model, train_data, val_data)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/horovod.py", line 99, in start_training
self._results = trainer.run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 669, in run_train
self.train_loop.on_train_end()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
self.check_checkpoint_callback(should_update=True, is_last=True)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
cb.on_validation_end(self.trainer, model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 562, in _save_top_k_checkpoints
if self.check_monitor_top_k(trainer, current):
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in check_monitor_top_k
should_update_best_and_save = trainer.training_type_plugin.reduce_boolean_decision(should_update_best_and_save)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/parallel.py", line 86, in reduce_boolean_decision
decision = self.reduce(decision, reduce_op=ReduceOp.SUM)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/horovod.py", line 144, in reduce
raise ValueError(f"unrecognized `reduce_op`: {reduce_op}")
ValueError: unrecognized `reduce_op`: ReduceOp.SUM
Exception ignored in: <function tqdm.__del__ at 0x7f65d0a23280>
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1128, in __del__
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1341, in close
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1520, in display
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1131, in __repr__
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1481, in format_dict
Environment
Local Setup
- CUDA:
- GPU:
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- Tesla V100-SXM2-32GB
- available: True
- version: 11.1 - Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.8.0
- pytorch-lightning: 1.2.6
- tqdm: 4.51.0 - System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.8.8
- version: 0.4.0 release - final checks (releasing later today) #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019
Colab
- CUDA:
- GPU:
- Tesla K80
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.8.1+cu101
- pytorch-lightning: 1.2.7
- tqdm: 4.41.1
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task