-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
3rd partyRelated to a 3rd-partyRelated to a 3rd-partybugSomething isn't workingSomething isn't working
Milestone
Description
🐛 Bug
When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm.
If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I'm training on 4 workers.
This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.
result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
result.log("val_loss", loss, sync_dist=True)
As you can see I'm asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict.
Here's the full error message
Epoch 7: 0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
File "main.py", line 72, in <module>
main()
File "main.py", line 68, in main
trainer.fit(model, data_module)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
results = self.accelerator_backend.train()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
result = self.trainer.run_pretrain_routine(self.trainer.model)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
self.train()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
self.run_training_epoch()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
opt_closure_result = self.optimizer_closure(
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
model_ref.backward(self, closure_loss, optimizer, opt_idx)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
loss.backward()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Without early stopping it works ok.
### Environment
- CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.1rc1
- tensorboard: 2.2.0
- tqdm: 4.46.1
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.2
- version: Proposal for help #1 SMP Fri Apr 20 16:44:24 UTC 2018
also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem
Metadata
Metadata
Assignees
Labels
3rd partyRelated to a 3rd-partyRelated to a 3rd-partybugSomething isn't workingSomething isn't working