Skip to content

Logging issue on TPU VM Pod  #7912

@tgisaturday

Description

@tgisaturday

🐛 Bug

Please reproduce using the BoringModel

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer.
While running code on Google Cloud TPU VM Pod v3-8 successfully runs,
process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).

To Reproduce

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).

Expected behavior

Run without crash on v3-32.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

TPU VM Pod Software: v2-alpha

  • PyTorch Version (e.g., 1.0): 1.8.1
  • OS (e.g., Linux): Ubuntu
  • How you installed PyTorch (conda, pip, source): bulit-in image in v2-alpha

Additional context

I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint.
Here are test codes that I've used.
testcode.zip

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions