Skip to content

How to use TF Hub on a distributed setting? #48

@cgarciae

Description

@cgarciae

I want to use the ResNet-101-v2 feature vectors to do some transfer learning. I am training with the Estimator API on GCP, I call the hub Module at the beggining of the model_fn.

module_url = "https://tfhub.dev/google/imagenet/resnet_v2_101/feature_vector/1"
module = hub.Module(module_url)
height, width = hub.get_expected_image_size(module)
images = tf.image.resize_images(input_tensor, [height, width])
feature_vectors = module(images)

When I run in a single node ("basic-gpu") all is well, however, when I run the same code in distributed mode ("standard-1") I get this error:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 337, in _set_variable_or_list_initializer _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "") File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 299, in _set_checkpoint_initializer ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables/variables: Not found: /tmp/tfhub_modules/e0c607f95a3d67bc8928a5c20d09d1915322cfcb/variables; No such file or directory [[Node: checkpoint_initializer_537 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](checkpoint_initializer_537/prefix, checkpoint_initializer_537/tensor_names, checkpoint_initializer_537/shape_and_slices)]] [[Node: init/NoOp_3_S22 = _Recvclient_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:1/device:CPU:0", send_device_incarnation=-7983147897712139617, tensor_name="edge_3296_init/NoOp_3", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]] To find out more about why your job exited please check the logs: ....

How should I structure my code for TF Hub to work with the Estimator API for distributed training?

Metadata

Metadata

Assignees

Labels

hubFor all issues related to tf hub library and tf hub tutorials or examples posted by hub teamsubtype:image-feature-vectortype:support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions