Skip to content

No TPU devices were found in a TPU pod env. #6692

@jiasenwu

Description

@jiasenwu

🐛 Bug

To Reproduce

  • Run in a GCP instance group of size 4 + a TPU v2-32.
  • Add tpu_cores=8 to the boringmodel
(py36) jiasen@instance-group-1-cntd:~$ diff bug_report_model.py the_boringmodel.py 
148a149,150
>         precision=16,
>         tpu_cores=8,

Command to run:

python -m torch_xla.distributed.xla_dist --tpu=pod --docker-image=gcr.io/tpu-pytorch/xla:r1.8 \
    --docker-run-flag=--rm=true \
    --docker-run-flag=--shm-size=16GB \
    --docker-run-flag=-v \
    --docker-run-flag=/home/jiasen:/app \
    --docker-run-flag=-w \
    --docker-run-flag=/app \
    --env=XLA_USE_BF16=1 \
    -- bash -c "pip install pytorch_lightning && python the_boringmodel.py"

The exception occurs immediately after the pytorch_lightning is installed. The exception repeats itself because it happens on each instance. Here I copy only one.

2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 31, in inner_f
2021-03-26 20:38:10 10.164.0.42 [2]     queue.put(func(*args, **kwargs))
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/xla_device.py", line 83, in _is_device_tpu
2021-03-26 20:38:10 10.164.0.42 [2]     device = xm.xla_device()
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 231, in xla_device
2021-03-26 20:38:10 10.164.0.42 [2]     devkind=devkind if devkind is not None else None)
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
2021-03-26 20:38:10 10.164.0.42 [2]     xla_devices = _DEVICES.value
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/utils/utils.py", line 32, in value
2021-03-26 20:38:10 10.164.0.42 [2]     self._value = self._gen_fn()
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 18, in <lambda>
2021-03-26 20:38:10 10.164.0.42 [2]     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
2021-03-26 20:38:10 10.164.0.42 [2] RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:258 : Check failed: default_device_target != options_.global_device_map.end() 
2021-03-26 20:38:10 10.164.0.42 [2] *** Begin stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] 	tensorflow::CurrentStackTrace()
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::XrtComputationClient::XrtComputationClient(xla::XrtComputationClient::Options, std::unique_ptr<tensorflow::tpu::TopologyProto, std::default_delete<tensorflow::tpu::TopologyProto> >, xla::XrtLocalService*)
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::ComputationClient::Create()
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	xla::ComputationClient::Get()
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyCFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_GenericGetAttrWithDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_Call_Prepend
2021-03-26 20:38:10 10.164.0.42 [2] 	PyObject_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] 	PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyFunction_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_FastCallDict
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyObject_CallMethodIdObjArgs
2021-03-26 20:38:10 10.164.0.42 [2] 	PyImport_ImportModuleLevelObject
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCodeEx
2021-03-26 20:38:10 10.164.0.42 [2] 	PyEval_EvalCode
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	PyCFunction_Call
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] 	_PyEval_EvalFrameDefault
2021-03-26 20:38:10 10.164.0.42 [2] 	
2021-03-26 20:38:10 10.164.0.42 [2] *** End stack trace ***
2021-03-26 20:38:10 10.164.0.42 [2] 
0it [00:00, ?it/s]0 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2] Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /root/anaconda3/envs/pytorch/lib/python3.6/site-packages/Datasets/MNIST/raw/train-images-idx3-ubyte.gz
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2]                     ####
2021-03-26 20:38:10 10.164.0.42 [2]                 ###########
2021-03-26 20:38:10 10.164.0.42 [2]              ####################
2021-03-26 20:38:10 10.164.0.42 [2]          ############################
2021-03-26 20:38:10 10.164.0.42 [2]     #####################################
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2] #########################  ###################
2021-03-26 20:38:10 10.164.0.42 [2] #######################    ###################
2021-03-26 20:38:10 10.164.0.42 [2] ####################      ####################
2021-03-26 20:38:10 10.164.0.42 [2] ##################       #####################
2021-03-26 20:38:10 10.164.0.42 [2] ################        ######################
2021-03-26 20:38:10 10.164.0.42 [2] #####################        #################
2021-03-26 20:38:10 10.164.0.42 [2] Traceback (most recent call last):
2021-03-26 20:38:10 10.164.0.42 [2] ######################     ###################
2021-03-26 20:38:10 10.164.0.42 [2]   File "the_boringmodel.py", line 153, in <module>
2021-03-26 20:38:10 10.164.0.42 [2] #####################    #####################
2021-03-26 20:38:10 10.164.0.42 [2]     test_run()
2021-03-26 20:38:10 10.164.0.42 [2] ####################   #######################
2021-03-26 20:38:10 10.164.0.42 [2]   File "the_boringmodel.py", line 145, in test_run
2021-03-26 20:38:10 10.164.0.42 [2] ###################  #########################
2021-03-26 20:38:10 10.164.0.42 [2]     tpu_cores=8,
2021-03-26 20:38:10 10.164.0.42 [2] ##############################################
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
2021-03-26 20:38:10 10.164.0.42 [2]     #####################################
2021-03-26 20:38:10 10.164.0.42 [2]     return fn(self, **kwargs)
2021-03-26 20:38:10 10.164.0.42 [2]          ############################
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 321, in __init__
2021-03-26 20:38:10 10.164.0.42 [2]              ####################
2021-03-26 20:38:10 10.164.0.42 [2]     replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
2021-03-26 20:38:10 10.164.0.42 [2]                   ##########
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 91, in __init__
2021-03-26 20:38:10 10.164.0.42 [2]                      ####
2021-03-26 20:38:10 10.164.0.42 [2]     self.tpu_cores = device_parser.parse_tpu_cores(tpu_cores)
2021-03-26 20:38:10 10.164.0.42 [2] 
2021-03-26 20:38:10 10.164.0.42 [2]   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 113, in parse_tpu_cores
2021-03-26 20:38:10 10.164.0.42 [2]     raise MisconfigurationException('No TPU devices were found.')
2021-03-26 20:38:10 10.164.0.42 [2] pytorch_lightning.utilities.exceptions.MisconfigurationException: No TPU devices were found.

Expected behavior

TPU is definitely avaialbe.

Environment

  • PyTorch Version: 1.2.5
  • OS: Ubuntu 20.04.2 LTS
  • Python version: 3.6
  • docker image: gcr.io/tpu-pytorch/xla:r1.8
  • xla: r1.8
  • How you installed PyTorch: provided in gcr.io/tpu-pytorch/xla:r1.8

Additional context

I have tried a simple workaround by setting _TPU_AVAILABLE = True in https://github.com/PyTorchLightning/pytorch-lightning/blob/0e45220263f4e2045dfe7f68e3e0eaac0b2033d5/pytorch_lightning/utilities/__init__.py#L52. And it works. No more exceptions and model trains perfectly!

I think the logic of TPU detection in a pod environment is wrong or out-dated w.r.t the current xla (note it works with single TPU device). I see the official xla code uses xmp.spawn to spawn a process to get the potential TPU device.

Besides, I think most places that checking _TPU_AVAILABLE (guarding to import XLA) at the top-level can be replaced by checking _XLA_AVAILABLE.

Metadata

Metadata

Assignees

Labels

accelerator: tpuTensor Processing UnitbugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions