Skip to content

Hanging with TPUs on GCE VM #5841

@zcain117

Description

@zcain117

🐛 Bug

Seems like training of any model hangs indefinitely when running on a Google Compute Engine VM.

Mainly I've been trying this example model but I've also tried the LitAutoEncoder from this page.

Note that all unit tests pass, including the 8-core model training.

There seem to be 2 key areas that trigger a hang:

  1. Eval loop starting up. If I delay the eval loop with check_val_every_n_epoch=50 + max_epochs=60, the model will train all 50 epochs. But once the eval loop starts up, it will typically hang before finishing the 1st eval loop.
  2. Train loop finishing. If I avoid the eval loop (e.g. check_val_every_n_epoch=100 + max_epochs=50), then the model will finish all 50 training epochs and then the process will hang.

There seems to be something wrong with multiprocesses starting or stopping. Since this unit test training works but real training hangs, I'm wondering if there is something important about @pl_multi_process_test that allows tests to succeed. Maybe we need to add that functionality to the more general TPU training?

Let me know if there are other things I should try.

Description

Typically the hang looks something like this:

Epoch 0:  94%|████████████████████████████▏ | 45/48 [00:53<00:03,  1.19s/it, loss=0.947, v_num=2, train_loss=0.524]
Validating:  75%|█████████████████████████████████████████████████▌                | 12/16 [00:18<00:03,  1.13it/s]
(hangs...)

In other words, it gets stuck midway through an epoch.

When I kill the process I see:

Epoch 0:  88%|██████████████████████████▎   | 42/48 [00:18<00:02,  2.26it/s, loss=1.15, v_num=12, train_loss=0.185^CTraceback (most recent call last):███████████████████████████████████████████▉    | 15/16 [00:15<00:00,  1.99it/s]
  File "computer_vision_fine_tuning.py", line 455, in <module>
    main(get_args())
  File "computer_vision_fine_tuning.py", line 437, in main
    trainer.fit(model)
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 114, in train
    start_method=self.start_method
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 77, in join
    timeout=timeout,
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt

Experiments so far:

  • torch-xla-nightly + pip install pytorch-lightning: Hangs
  • torch-xla-1.7 + pip install pytorch-lightning: Hangs
  • torch-xla-1.7 + pip install 'pytorch-lightning==1.0.0': Hangs
  • torch-xla-1.7 + pip install 'pytorch-lightning==0.9.0': Crashes using same model as above (i.e. python computer_vision_fine_tuning.py). I also tried this start model from 0.9.0 documentation and it also crashes with Value out of range (expected to be in range of [-1, 0], but got 1)
  • torch-xla-1.6 + pip install pytorch-lightning: Hangs
  • pytorch nightly docker image + pip install pytorch-lightning: Hangs

To Reproduce

  1. Make a GCE VM using the PyTorch/XLA image
  2. conda activate torch-xla-1.7
  3. pip install pytorch-lightning
  4. git clone https://github.com/PyTorchLightning/pytorch-lightning.git
  5. cd pytorch-lightning/pl_examples/domain_templates
  6. vim computer_vision_fine_tuning.py
  7. add tpu_cores=8 to Trainer and remove any GPU args
  8. make a TPU
  9. export TPU_IP_ADDRESS=<the TPU's IP>
  10. export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
  11. python computer_vision_fine_tuning.py

Expected behavior

training / eval completes and the process exits

Please reproduce using the BoringModel

I think I can't make my colab public due to work restrictions. The base BoringModel fails for me anyway on the "Data" cell:

!pip install wandb
from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset

ImportError                               Traceback (most recent call last)
<ipython-input-3-c3916d211d14> in <module>()
      1 # some other options for random data
      2 get_ipython().system('pip install wandb')
----> 3 from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset

4 frames
/usr/local/lib/python3.6/dist-packages/pl_bolts/datasets/imagenet_dataset.py in <module>()
     10 import numpy as np
     11 import torch
---> 12 from torch._six import PY3
     13 
     14 from pl_bolts.utils import _TORCHVISION_AVAILABLE

ImportError: cannot import name 'PY3'

Environment

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • CUDA:
    - GPU:
    - available: False
    - version: 10.2
  • Packages:
    - numpy: 1.19.2
    - pyTorch_debug: True
    - pyTorch_version: 1.7.0
    - pytorch-lightning: 1.1.7
    - tqdm: 4.56.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor:
    - python: 3.6.10
    - version: Proposal for help #1 SMP Debian 4.9.246-2 (2020-12-17)

Metadata

Metadata

Assignees

Labels

accelerator: tpuTensor Processing UnitbugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions