-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
Please reproduce using the BoringModel
Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer.
While running code on Google Cloud TPU VM Pod v3-8 successfully runs,
process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).
To Reproduce
Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).
Expected behavior
Run without crash on v3-32.
Environment
Note: Bugs with code are solved faster ! Colab Notebook should be made public !
-
IDE: Please, use our python bug_report_model.py template. -
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
TPU VM Pod Software: v2-alpha
- PyTorch Version (e.g., 1.0): 1.8.1
- OS (e.g., Linux): Ubuntu
- How you installed PyTorch (
conda,pip, source): bulit-in image in v2-alpha
Additional context
I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint.
Here are test codes that I've used.
testcode.zip