Hanging with TPUs on GCE VM

## 🐛 Bug

Seems like training of any model hangs indefinitely when running on a Google Compute Engine VM.

Mainly I've been trying [this example model](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/computer_vision_fine_tuning.py) but I've also tried the `LitAutoEncoder` from [this page](https://pytorch-lightning.readthedocs.io/en/latest/new-project.html).

**Note that all unit tests pass**, including the [8-core model training](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_tpu.py#L84).

There seem to be 2 key areas that trigger a hang:
1. Eval loop starting up. If I delay the eval loop with `check_val_every_n_epoch=50` + `max_epochs=60`, the model will train all 50 epochs. But once the eval loop starts up, it will typically hang before finishing the 1st eval loop.
2. Train loop finishing. If I avoid the eval loop (e.g. `check_val_every_n_epoch=100` + `max_epochs=50`), then the model will finish all 50 training epochs and then the process will hang.

There seems to be something wrong with multiprocesses starting or stopping. Since this unit test training works but real training hangs, I'm wondering if there is something important about `@pl_multi_process_test` that allows tests to succeed. Maybe we need to add that functionality to the more general TPU training?

Let me know if there are other things I should try.

### Description

Typically the hang looks something like this:
```
Epoch 0:  94%|████████████████████████████▏ | 45/48 [00:53<00:03,  1.19s/it, loss=0.947, v_num=2, train_loss=0.524]
Validating:  75%|█████████████████████████████████████████████████▌                | 12/16 [00:18<00:03,  1.13it/s]
(hangs...)
```
In other words, it gets stuck midway through an epoch.

When I kill the process I see:
```
Epoch 0:  88%|██████████████████████████▎   | 42/48 [00:18<00:02,  2.26it/s, loss=1.15, v_num=12, train_loss=0.185^CTraceback (most recent call last):███████████████████████████████████████████▉    | 15/16 [00:15<00:00,  1.99it/s]
  File "computer_vision_fine_tuning.py", line 455, in <module>
    main(get_args())
  File "computer_vision_fine_tuning.py", line 437, in main
    trainer.fit(model)
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 114, in train
    start_method=self.start_method
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 77, in join
    timeout=timeout,
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/anaconda3/envs/torch-xla-1.7-1/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
```

### Experiments so far:
* torch-xla-nightly + `pip install pytorch-lightning`: Hangs
* torch-xla-1.7 + `pip install pytorch-lightning`: Hangs
* torch-xla-1.7 + `pip install 'pytorch-lightning==1.0.0'`: Hangs
* torch-xla-1.7 + `pip install 'pytorch-lightning==0.9.0'`: Crashes using same model as above (i.e. python computer_vision_fine_tuning.py). I also tried this start model from 0.9.0 documentation and it also crashes with `Value out of range (expected to be in range of [-1, 0], but got 1)`
* torch-xla-1.6 + pip install pytorch-lightning: Hangs
* pytorch nightly docker image + `pip install pytorch-lightning`: Hangs

### To Reproduce
1. Make a GCE VM using the PyTorch/XLA image
2. conda activate torch-xla-1.7
3. `pip install pytorch-lightning`
4. `git clone https://github.com/PyTorchLightning/pytorch-lightning.git`
5. cd pytorch-lightning/pl_examples/domain_templates
6. vim computer_vision_fine_tuning.py 
7. add `tpu_cores=8` to Trainer and remove any GPU args
8. make a TPU
9. export TPU_IP_ADDRESS=<the TPU's IP>
10. export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
11. python computer_vision_fine_tuning.py 

### Expected behavior

training / eval completes and the process exits

## Please reproduce using the BoringModel

I think I can't make my colab public due to work restrictions. The base BoringModel fails for me anyway on the "Data" cell:
```
!pip install wandb
from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset

ImportError                               Traceback (most recent call last)
<ipython-input-3-c3916d211d14> in <module>()
      1 # some other options for random data
      2 get_ipython().system('pip install wandb')
----> 3 from pl_bolts.datasets import RandomDataset, DummyDataset, RandomDictDataset

4 frames
/usr/local/lib/python3.6/dist-packages/pl_bolts/datasets/imagenet_dataset.py in <module>()
     10 import numpy as np
     11 import torch
---> 12 from torch._six import PY3
     13 
     14 from pl_bolts.utils import _TORCHVISION_AVAILABLE

ImportError: cannot import name 'PY3'
```


### Environment

You can get the script and run it with:
```
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
```
* CUDA:
        - GPU:
        - available:         False
        - version:           10.2
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     True
        - pyTorch_version:   1.7.0
        - pytorch-lightning: 1.1.7
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         
        - python:            3.6.10
        - version:           #1 SMP Debian 4.9.246-2 (2020-12-17)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hanging with TPUs on GCE VM #5841

🐛 Bug

Description

Experiments so far:

To Reproduce

Expected behavior

Please reproduce using the BoringModel

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hanging with TPUs on GCE VM #5841

Description

🐛 Bug

Description

Experiments so far:

To Reproduce

Expected behavior

Please reproduce using the BoringModel

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions