-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
To Reproduce
- Run in a GCP instance group of size 4 + a TPU v2-32.
- Add
tpu_cores=8to the boringmodel
(py36) jiasen@instance-group-1-cntd:~$ diff bug_report_model.py the_boringmodel.py
148a149,150
> precision=16,
> tpu_cores=8,
Command to run:
python -m torch_xla.distributed.xla_dist --tpu=pod --docker-image=gcr.io/tpu-pytorch/xla:r1.8 \
--docker-run-flag=--rm=true \
--docker-run-flag=--shm-size=16GB \
--docker-run-flag=-v \
--docker-run-flag=/home/jiasen:/app \
--docker-run-flag=-w \
--docker-run-flag=/app \
--env=XLA_USE_BF16=1 \
-- bash -c "pip install pytorch_lightning && python the_boringmodel.py"
The exception occurs immediately after the pytorch_lightning is installed. The exception repeats itself because it happens on each instance. Here I copy only one.
2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
2021-03-26 20:38:10 10.164.0.42 [2] queue.put(func(*args, **kwargs))
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 83, in _is_device_tpu
2021-03-26 20:38:10 10.164.0.42 [2] device = xm.xla_device()
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
2021-03-26 20:38:10 10.164.0.42 [2] devkind=devkind if devkind is not None else None)
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
2021-03-26 20:38:10 10.164.0.42 [2] xla_devices = _DEVICES.value
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
2021-03-26 20:38:10 10.164.0.42 [2] self._value = self._gen_fn()
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
2021-03-26 20:38:10 10.164.0.42 [2] _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
2021-03-26 20:38:10 10.164.0.42 [2] RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end()
2021-03-26 20:38:10 10.164.0.42 [2] *** Begin stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] tensorflow::CurrentStackTrace()
2021-03-26 20:38:10 10.164.0.42 [2] xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >, xla::XrtLocalService*)
2021-03-26 20:38:10 10.164.0.42 [2] xla::ComputationClient::Create()
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] xla::ComputationClient::Get()
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyCFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_GenericGetAttrWithDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] _PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_Call_Prepend
2021-03-26 20:38:10 10.164.0.42 [2] PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] _PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] _PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] _PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] _PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] *** End stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2]
0it [00:00, ?it/s]0 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /root/anaconda3/envs/pytorch/lib/python3.6/site-packages/Datasets/MNIST/raw/train-images-idx3-ubyte.gz
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] ####
2021-03-26 20:38:10 10.164.0.42 [2] ###########
2021-03-26 20:38:10 10.164.0.42 [2] ####################
2021-03-26 20:38:10 10.164.0.42 [2] ############################
2021-03-26 20:38:10 10.164.0.42 [2] #####################################
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2] ######################### ###################
2021-03-26 20:38:10 10.164.0.42 [2] ####################### ###################
2021-03-26 20:38:10 10.164.0.42 [2] #################### ####################
2021-03-26 20:38:10 10.164.0.42 [2] ################## #####################
2021-03-26 20:38:10 10.164.0.42 [2] ################ ######################
2021-03-26 20:38:10 10.164.0.42 [2] ##################### #################
2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2] ###################### ###################
2021-03-26 20:38:10 10.164.0.42 [2] File "the_boringmodel.py", line 153, in <module>
2021-03-26 20:38:10 10.164.0.42 [2] ##################### #####################
2021-03-26 20:38:10 10.164.0.42 [2] test_run()
2021-03-26 20:38:10 10.164.0.42 [2] #################### #######################
2021-03-26 20:38:10 10.164.0.42 [2] File "the_boringmodel.py", line 145, in test_run
2021-03-26 20:38:10 10.164.0.42 [2] ################### #########################
2021-03-26 20:38:10 10.164.0.42 [2] tpu_cores=8,
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
2021-03-26 20:38:10 10.164.0.42 [2] #####################################
2021-03-26 20:38:10 10.164.0.42 [2] return fn(self, **kwargs)
2021-03-26 20:38:10 10.164.0.42 [2] ############################
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 321, in __init__
2021-03-26 20:38:10 10.164.0.42 [2] ####################
2021-03-26 20:38:10 10.164.0.42 [2] replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
2021-03-26 20:38:10 10.164.0.42 [2] ##########
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 91, in __init__
2021-03-26 20:38:10 10.164.0.42 [2] ####
2021-03-26 20:38:10 10.164.0.42 [2] self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
2021-03-26 20:38:10 10.164.0.42 [2]
2021-03-26 20:38:10 10.164.0.42 [2] File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 113, in parse_tpu_cores
2021-03-26 20:38:10 10.164.0.42 [2] raise MisconfigurationException('No TPU devices were found.')
2021-03-26 20:38:10 10.164.0.42 [2] pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.
Expected behavior
TPU is definitely avaialbe.
Environment
IDE: Please, use our python bug_report_model.py template.
- PyTorch Version: 1.2.5
- OS: Ubuntu 20.04.2 LTS
- Python version: 3.6
- docker image: gcr.io/tpu-pytorch/xla:r1.8
- xla: r1.8
- How you installed PyTorch: provided in gcr.io/tpu-pytorch/xla:r1.8
Additional context
I have tried a simple workaround by setting _TPU_AVAILABLE = True in https://github.com/PyTorchLightning/pytorch-lightning/blob/0e45220263f4e2045dfe7f68e3e0eaac0b2033d5/pytorch_lightning/utilities/__init__.py#L52. And it works. No more exceptions and model trains perfectly!
I think the logic of TPU detection in a pod environment is wrong or out-dated w.r.t the current xla (note it works with single TPU device). I see the official xla code uses xmp.spawn to spawn a process to get the potential TPU device.
Besides, I think most places that checking _TPU_AVAILABLE (guarding to import XLA) at the top-level can be replaced by checking _XLA_AVAILABLE.