-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Seems like training of any model hangs indefinitely when running on a Google Compute Engine VM.
Mainly I've been trying this example model but I've also tried the LitAutoEncoder from this page.
Note that all unit tests pass, including the 8-core model training.
There seem to be 2 key areas that trigger a hang:
- Eval loop starting up. If I delay the eval loop with
check_val_every_n_epoch=50+max_epochs=60, the model will train all 50 epochs. But once the eval loop starts up, it will typically hang before finishing the 1st eval loop. - Train loop finishing. If I avoid the eval loop (e.g.
check_val_every_n_epoch=100+max_epochs=50), then the model will finish all 50 training epochs and then the process will hang.
There seems to be something wrong with multiprocesses starting or stopping. Since this unit test training works but real training hangs, I'm wondering if there is something important about @pl_multi_process_test that allows tests to succeed. Maybe we need to add that functionality to the more general TPU training?
Let me know if there are other things I should try.
Description
Typically the hang looks something like this:
Epoch 0: 94%|████████████████████████████▏ | 45/48 [00:53<00:03, 1.19s/it, loss=0.947, v_num=2, train_loss=0.524]
Validating: 75%|█████████████████████████████████████████████████▌ | 12/16 [00:18<00:03, 1.13it/s]
(hangs...)
In other words, it gets stuck midway through an epoch.
When I kill the process I see:
Epoch 0: 88%|██████████████████████████▎ | 42/48 [00:18<00:02, 2.26it/s, loss=1.15, v_num=12, train_loss=0.185^CTraceback (most recent call last):███████████████████████████████████████████▉ | 15/16 [00:15<00:00, 1.99it/s]
File "computer_vision_fine_tuning.py", line 455, in <module>
main(get_args())
File "computer_vision_fine_tuning.py", line 437, in main
trainer.fit(model)
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
results = self.accelerator_backend.train()
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 114, in train
start_method=self.start_method
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
start_method=start_method)
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 77, in join
timeout=timeout,
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
Experiments so far:
- torch-xla-nightly +
pip install pytorch-lightning: Hangs - torch-xla-1.7 +
pip install pytorch-lightning: Hangs - torch-xla-1.7 +
pip install 'pytorch-lightning==1.0.0': Hangs - torch-xla-1.7 +
pip install 'pytorch-lightning==0.9.0': Crashes using same model as above (i.e. python computer_vision_fine_tuning.py). I also tried this start model from 0.9.0 documentation and it also crashes withValue out of range (expected to be in range of [-1, 0], but got 1) - torch-xla-1.6 + pip install pytorch-lightning: Hangs
- pytorch nightly docker image +
pip install pytorch-lightning: Hangs
To Reproduce
- Make a GCE VM using the PyTorch/XLA image
- conda activate torch-xla-1.7
pip install pytorch-lightninggit clone https://github.com/PyTorchLightning/pytorch-lightning.git- cd pytorch-lightning/pl_examples/domain_templates
- vim computer_vision_fine_tuning.py
- add
tpu_cores=8to Trainer and remove any GPU args - make a TPU
- export TPU_IP_ADDRESS=<the TPU's IP>
- export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
- python computer_vision_fine_tuning.py
Expected behavior
training / eval completes and the process exits
Please reproduce using the BoringModel
I think I can't make my colab public due to work restrictions. The base BoringModel fails for me anyway on the "Data" cell:
!pip install wandb
from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset
ImportError Traceback (most recent call last)
<ipython-input-3-c3916d211d14> in <module>()
1 # some other options for random data
2 get_ipython().system('pip install wandb')
----> 3 from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset
4 frames
/usr/local/lib/python3.6/dist-packages/pl_bolts/datasets/imagenet_dataset.py in <module>()
10 import numpy as np
11 import torch
---> 12 from torch._six import PY3
13
14 from pl_bolts.utils import _TORCHVISION_AVAILABLE
ImportError: cannot import name 'PY3'
Environment
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
- CUDA:
- GPU:
- available: False
- version: 10.2 - Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.1.7
- tqdm: 4.56.0 - System:
- OS: Linux
- architecture:
- 64bit
-
- processor:
- python: 3.6.10
- version: Proposal for help #1 SMP Debian 4.9.246-2 (2020-12-17)