Skip to content

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

@undertherain

Description

@undertherain

🐛 Bug

When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm.
If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I'm training on 4 workers.
This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.

        result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True)

As you can see I'm asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict.
Here's the full error message

Epoch 7:   0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
  File "main.py", line 72, in <module>       
    main()
  File "main.py", line 68, in main
    trainer.fit(model, data_module)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
    results = self.accelerator_backend.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
    result = self.trainer.run_pretrain_routine(self.trainer.model)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
    self.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
    self.run_training_epoch()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
    opt_closure_result = self.optimizer_closure(
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
    loss.backward()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Without early stopping it works ok. 

### Environment
  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.19.1
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0
    • pytorch-lightning: 0.9.1rc1
    • tensorboard: 2.2.0
    • tqdm: 4.46.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.2
    • version: Proposal for help #1 SMP Fri Apr 20 16:44:24 UTC 2018
also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem

Metadata

Metadata

Labels

3rd partyRelated to a 3rd-partybugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions