-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
I want to use the ResNet-101-v2 feature vectors to do some transfer learning. I am training with the Estimator API on GCP, I call the hub Module at the beggining of the model_fn
.
module_url = "https://tfhub.dev/google/imagenet/resnet_v2_101/feature_vector/1"
module = hub.Module(module_url)
height, width = hub.get_expected_image_size(module)
images = tf.image.resize_images(input_tensor, [height, width])
feature_vectors = module(images)
When I run in a single node ("basic-gpu") all is well, however, when I run the same code in distributed mode ("standard-1") I get this error:
The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 337, in _set_variable_or_list_initializer _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "") File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 299, in _set_checkpoint_initializer ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables/variables: Not found: /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables; No such file or directory [[Node: checkpoint_initializer_537 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](checkpoint_initializer_537/prefix, checkpoint_initializer_537/tensor_names, checkpoint_initializer_537/shape_and_slices)]] [[Node: init/NoOp_3_S22 = _Recvclient_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=-7983147897712139617, tensor_name="edge_3296_init/NoOp_3", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]] To find out more about why your job exited please check the logs: ....
How should I structure my code for TF Hub to work with the Estimator API for distributed training?