Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Apr 14, 2019

Next step is to expose attributes for variant

Signed-off-by: Yong Tang [email protected]

@yongtang
Copy link
Member Author

/cc @suphoff

@yongtang
Copy link
Member Author

Some additional items:

  • To expose image attributes stored in variant in Python, ideally in a generic way so that no additional repeated C++ code.
  • My understanding with TIFF is that TIFF might have different shape for each frame? Not captured yet in this PR.

@yongtang yongtang force-pushed the ImageInput branch 2 times, most recently from 61d2060 to 26032f6 Compare April 15, 2019 02:47
Next step is to expose attributes for variant

Signed-off-by: Yong Tang <[email protected]>
@yongtang yongtang requested a review from terrytangyuan April 16, 2019 03:46
@yongtang yongtang changed the title [WIP] Update Image to use variant for reference image objects Update Image to use variant for reference image objects Apr 16, 2019
@yongtang
Copy link
Member Author

With PR #184 merged, this PR has been rebased and all tests passed. /cc @terrytangyuan to take a look as well.

"""

def __init__(self, filename):
"""Create a ImageReader.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImageReader -> ImageDataset

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

yongtang commented Apr 16, 2019

@suphoff Yes the idea is about right. There are several additional notes here:

  1. A variant contains filename, archive filename, and compression filters minimally, in order to open it independently without and prior state information. This helps in the future in case we want to use multiple devices to open variants concurrently. /cc @terrytangyuan for this one as well.

  2. A variant could be extended to carry anything beyond the above, such as a metadata. One example is in ImageInput where we just stored a format string that identifies the format of the image, e.g., tiff, webp, etc. This is just an early example, but metadata could be extended to anything, e.g., encoding, shape, channels, or even something totally high level stuff such as statistics of the pixels, bounding boxes, etc. For too high level stuff I think we could do layered approach so that the base variant is for primitive metadata, then we have another variant to encapsulate the base variant + high level metadata.

  3. The metadata extraction and fill in variant, is separated from the Dataset processing. Dataset processing need ReadRecord() API to wire up the Dataset flow. The metadata extraction happens in FromStream API call. Any additional metadata could be filled in to Variant during this call. Note filename, filters, and archive entry has already been embedded into the base variant so any additional metadata will need to extend the variant (not to start from index (0)).

  4. Technically, string type is not necessary. However, in Dataset's implementation (in tensorflow), a subgraph is created in C++ and added in runtime. This helps with graph optimization (to avoid getting back and forth in between C++ and python). But, the subgraph optimization is limited and have a hard time to interpret the variant data type.
    As a result, I did a workaround so the input will take a string or a variant. Python will pass a variant, but when subgraph is created in runtime, the variant is serialized into a string and pass the input (in string). For the node input, it will check the input type. If the input type is string, it means it is passed by subgraph optimization, thus it will convert back to variant.

As we discussed the idea is to store metadata and this metadata does not necessarily related to Dataset. But we also want to make sure the graph creation fits into the Dataset's graph implementation. I think variants meets both usage, and it provides ways to extend metadata that could be used by other ops.

@yongtang
Copy link
Member Author

@suphoff Adding string type to fit into Dataset's internal graph optimization happened in 4019a0c

You can take a look at the Status AsGraphDefInternal() API both in tensorflow-io and tensorflow's Dataset. If we return Unimplemnted error in this API, the graph optimization will be disabled (so no need to convert back and forth with string). However, a warning will show up by Dataset in tensorflow's graph.

Status AsGraphDefInternal(SerializationContext* ctx,
DatasetGraphDefBuilder* b,
Node** node) const override {
Node* input_node;
Tensor input_tensor(DT_STRING, TensorShape({static_cast<int64>(input_.size())}));
// GraphDefInternal has some trouble with Variant so use serialized string.
for (size_t i = 0; i < input_.size(); i++) {
string message;
VariantTensorData serialized_data_f;
VariantTensorDataProto serialized_proto_f;
input_[i].Encode(&serialized_data_f);
serialized_data_f.ToProto(&serialized_proto_f);
EncodeVariant(serialized_proto_f, &message);
input_tensor.flat<string>()(i) = message;
}
TF_RETURN_IF_ERROR(b->AddTensor(input_tensor, &input_node));
TF_RETURN_IF_ERROR(b->AddDataset(this, {input_node }, node));
return Status::OK();
}

@yongtang
Copy link
Member Author

@suphoff One additional note about FromStream to parse metadata: If there is nothing to store for metadata, this function could just do nothing. One example is the TextInput in another PR:
https://github.com/tensorflow/io/pull/189/files#diff-d2f0edac34eda608ddc740a2dc13a339

As text files are separated by lines and there is nothing else, the function above does not do anything.

@yongtang yongtang changed the title Update Image to use variant for reference image objects [WIP] Update Image to use variant for reference image objects Apr 18, 2019
@yongtang
Copy link
Member Author

Let me change the PR to Working-in-Progress, so that we could have additional thinking. There are several things we want to solve here:

  • When we distribute data across different host/device, we want to avoid the data copying from one device to another host/device. We want to do the computation locally first, to reduce the network traffic. That means we don't want to read file content into a Tensor eagerly.
  • The idea of having a reference (e.g., Variant) to represent a file object, so that content could be read locally.

Variant Tensor is immutable, maybe we could have a pass through operation to explicitly read data, e.g.,

input = image_input(filename)
resolved_input = image_resolve(input)

We can pass through the content (no do anything) if the input has already been resolved, e.g.,

resolved_input = image_resolve(input)
resolved_input = image_resolve(input) # pass through

@suphoff
Copy link
Contributor

suphoff commented Apr 20, 2019

@yongtang : Variants may be immutable - but you can wrap a pointer to a reference counted C++ object into one - just like the Dataset implementation.

As for distribution across host/device I see two scenarios

  1. GPU or similar devices -> C++ object works as just the buffer location is different.
  2. Remote Host -> per Host pipeline ( I believe I spied the beginning of this schema (currently disabled) for file Datasets in tf.distribute - not 100% sure this is still on my reading list)

I just don't see a lot of usage of enumerating files matched by wildcard on one host and sending the filenames to a second host - but admit I could be totally wrong here.

A second issue is that I have not investigated with reference counted objects wrapped in variant tensors is graph saving/restore ( Just haven't looked into graph save/restore yet)

Happy to discuss in a VC or on gitter.

@yongtang
Copy link
Member Author

@suphoff it may not be a big issue for images as images are likely small in size. But for other data formats the data could be huge like GBs of data (and only part or a small chunk of data are truly used). In that situation, serialize and passing tensors of GB size around from one host, is not efficient. The reference of filename/entry will help distribute the with data that are not dense.

The Dataset in tensorflow was designed as an iterable or iterator, so it is not suited for distributed systems. The distribute strategy helps to an extent, but it still only applies to dense dataset where every bytes will be used, not other format where only a part or small chunks of data needed.

@suphoff
Copy link
Contributor

suphoff commented Apr 22, 2019

@yongtang : I agree serializing GBs of data would not be a good solution.
A common distribution strategy to read data using TFRecords seems to be to just use multiple files and distribute those files mutually exclusive across hosts. This strategy obviously does not work when using a single monolithic archive that contains the desired data streams. However in a distributed environment each host could read the same archive and then extract a per host mutually exclusive set of streams from the archive.

@yongtang
Copy link
Member Author

@suphoff Eventually if each file (or archived object) is still too large, it is possible to split the file or object so that each host is only responsible for a chunk of the data. So host one processes file_1:0-10000, host two processes file_1:10000-20000, and so on. The so-called "splittable file" used to be common in big data ecosystem. We could implement the splittable file support if there is a need. Though nowadays I see in more scenarios a large collection of smaller files than a small collection of large files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants