[WIP] Update Image to use variant for reference image objects #186

yongtang · 2019-04-14T14:01:30Z

Next step is to expose attributes for variant

Signed-off-by: Yong Tang [email protected]

yongtang · 2019-04-14T14:01:56Z

/cc @suphoff

yongtang · 2019-04-14T14:03:43Z

Some additional items:

To expose image attributes stored in variant in Python, ideally in a generic way so that no additional repeated C++ code.
My understanding with TIFF is that TIFF might have different shape for each frame? Not captured yet in this PR.

Next step is to expose attributes for variant Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-04-16T03:48:01Z

With PR #184 merged, this PR has been rebased and all tests passed. /cc @terrytangyuan to take a look as well.

terrytangyuan · 2019-04-16T04:17:17Z

tensorflow_io/image/python/ops/image_dataset_ops.py

+  """
+
+  def __init__(self, filename):
+    """Create a ImageReader.


ImageReader -> ImageDataset

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-04-16T16:27:41Z

@suphoff Yes the idea is about right. There are several additional notes here:

A variant contains filename, archive filename, and compression filters minimally, in order to open it independently without and prior state information. This helps in the future in case we want to use multiple devices to open variants concurrently. /cc @terrytangyuan for this one as well.
A variant could be extended to carry anything beyond the above, such as a metadata. One example is in ImageInput where we just stored a format string that identifies the format of the image, e.g., tiff, webp, etc. This is just an early example, but metadata could be extended to anything, e.g., encoding, shape, channels, or even something totally high level stuff such as statistics of the pixels, bounding boxes, etc. For too high level stuff I think we could do layered approach so that the base variant is for primitive metadata, then we have another variant to encapsulate the base variant + high level metadata.
The metadata extraction and fill in variant, is separated from the Dataset processing. Dataset processing need ReadRecord() API to wire up the Dataset flow. The metadata extraction happens in FromStream API call. Any additional metadata could be filled in to Variant during this call. Note filename, filters, and archive entry has already been embedded into the base variant so any additional metadata will need to extend the variant (not to start from index (0)).
Technically, string type is not necessary. However, in Dataset's implementation (in tensorflow), a subgraph is created in C++ and added in runtime. This helps with graph optimization (to avoid getting back and forth in between C++ and python). But, the subgraph optimization is limited and have a hard time to interpret the variant data type.
As a result, I did a workaround so the input will take a string or a variant. Python will pass a variant, but when subgraph is created in runtime, the variant is serialized into a string and pass the input (in string). For the node input, it will check the input type. If the input type is string, it means it is passed by subgraph optimization, thus it will convert back to variant.

As we discussed the idea is to store metadata and this metadata does not necessarily related to Dataset. But we also want to make sure the graph creation fits into the Dataset's graph implementation. I think variants meets both usage, and it provides ways to extend metadata that could be used by other ops.

yongtang · 2019-04-16T16:46:23Z

@suphoff Adding string type to fit into Dataset's internal graph optimization happened in 4019a0c

You can take a look at the Status AsGraphDefInternal() API both in tensorflow-io and tensorflow's Dataset. If we return Unimplemnted error in this API, the graph optimization will be disabled (so no need to convert back and forth with string). However, a warning will show up by Dataset in tensorflow's graph.

io/tensorflow_io/core/kernels/dataset_ops.h

Lines 307 to 325 in 4019a0c

    
           Status AsGraphDefInternal(SerializationContext* ctx, 
        
                                     DatasetGraphDefBuilder* b, 
        
                                     Node** node) const override { 
        
             Node* input_node; 
        
             Tensor input_tensor(DT_STRING, TensorShape({static_cast<int64>(input_.size())})); 
        
             // GraphDefInternal has some trouble with Variant so use serialized string. 
        
             for (size_t i = 0; i < input_.size(); i++) { 
        
               string message; 
        
               VariantTensorData serialized_data_f; 
        
               VariantTensorDataProto serialized_proto_f; 
        
               input_[i].Encode(&serialized_data_f); 
        
               serialized_data_f.ToProto(&serialized_proto_f); 
        
               EncodeVariant(serialized_proto_f, &message); 
        
               input_tensor.flat<string>()(i) = message; 
        
             } 
        
             TF_RETURN_IF_ERROR(b->AddTensor(input_tensor, &input_node)); 
        
             TF_RETURN_IF_ERROR(b->AddDataset(this, {input_node }, node)); 
        
             return Status::OK(); 
        
           }

yongtang · 2019-04-16T16:50:58Z

@suphoff One additional note about FromStream to parse metadata: If there is nothing to store for metadata, this function could just do nothing. One example is the TextInput in another PR:
https://github.com/tensorflow/io/pull/189/files#diff-d2f0edac34eda608ddc740a2dc13a339

As text files are separated by lines and there is nothing else, the function above does not do anything.

yongtang · 2019-04-18T17:35:47Z

Let me change the PR to Working-in-Progress, so that we could have additional thinking. There are several things we want to solve here:

When we distribute data across different host/device, we want to avoid the data copying from one device to another host/device. We want to do the computation locally first, to reduce the network traffic. That means we don't want to read file content into a Tensor eagerly.
The idea of having a reference (e.g., Variant) to represent a file object, so that content could be read locally.

Variant Tensor is immutable, maybe we could have a pass through operation to explicitly read data, e.g.,

input = image_input(filename)
resolved_input = image_resolve(input)

We can pass through the content (no do anything) if the input has already been resolved, e.g.,

resolved_input = image_resolve(input)
resolved_input = image_resolve(input) # pass through

suphoff · 2019-04-20T23:25:57Z

@yongtang : Variants may be immutable - but you can wrap a pointer to a reference counted C++ object into one - just like the Dataset implementation.

As for distribution across host/device I see two scenarios

GPU or similar devices -> C++ object works as just the buffer location is different.
Remote Host -> per Host pipeline ( I believe I spied the beginning of this schema (currently disabled) for file Datasets in tf.distribute - not 100% sure this is still on my reading list)

I just don't see a lot of usage of enumerating files matched by wildcard on one host and sending the filenames to a second host - but admit I could be totally wrong here.

A second issue is that I have not investigated with reference counted objects wrapped in variant tensors is graph saving/restore ( Just haven't looked into graph save/restore yet)

Happy to discuss in a VC or on gitter.

yongtang · 2019-04-21T03:18:20Z

@suphoff it may not be a big issue for images as images are likely small in size. But for other data formats the data could be huge like GBs of data (and only part or a small chunk of data are truly used). In that situation, serialize and passing tensors of GB size around from one host, is not efficient. The reference of filename/entry will help distribute the with data that are not dense.

The Dataset in tensorflow was designed as an iterable or iterator, so it is not suited for distributed systems. The distribute strategy helps to an extent, but it still only applies to dense dataset where every bytes will be used, not other format where only a part or small chunks of data needed.

suphoff · 2019-04-22T17:50:57Z

@yongtang : I agree serializing GBs of data would not be a good solution.
A common distribution strategy to read data using TFRecords seems to be to just use multiple files and distribute those files mutually exclusive across hosts. This strategy obviously does not work when using a single monolithic archive that contains the desired data streams. However in a distributed environment each host could read the same archive and then extract a per host mutually exclusive set of streams from the archive.

yongtang · 2019-04-28T21:45:34Z

@suphoff Eventually if each file (or archived object) is still too large, it is possible to split the file or object so that each host is only responsible for a chunk of the data. So host one processes file_1:0-10000, host two processes file_1:10000-20000, and so on. The so-called "splittable file" used to be common in big data ecosystem. We could implement the splittable file support if there is a need. Though nowadays I see in more scenarios a large collection of smaller files than a small collection of large files.

yongtang force-pushed the ImageInput branch 2 times, most recently from 61d2060 to 26032f6 Compare April 15, 2019 02:47

Update Image to use variant for reference image objects

6cb892c

Next step is to expose attributes for variant Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the ImageInput branch from 26032f6 to 6cb892c Compare April 15, 2019 21:13

yongtang requested a review from terrytangyuan April 16, 2019 03:46

yongtang changed the title ~~[WIP] Update Image to use variant for reference image objects~~ Update Image to use variant for reference image objects Apr 16, 2019

terrytangyuan reviewed Apr 16, 2019

View reviewed changes

Address review comment

7afe67a

Signed-off-by: Yong Tang <[email protected]>

yongtang changed the title ~~Update Image to use variant for reference image objects~~ [WIP] Update Image to use variant for reference image objects Apr 18, 2019

yongtang mentioned this pull request Jul 24, 2019

[WIP] Add FileDataset to read whole content of file into tf.data pipeline #366

Closed

yongtang closed this Sep 21, 2019

yongtang deleted the ImageInput branch January 20, 2020 04:03

[WIP] Update Image to use variant for reference image objects #186

[WIP] Update Image to use variant for reference image objects #186

Uh oh!

Conversation

yongtang commented Apr 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented Apr 14, 2019

Uh oh!

yongtang commented Apr 14, 2019

Uh oh!

yongtang commented Apr 16, 2019

Uh oh!

terrytangyuan Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

yongtang commented Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented Apr 16, 2019

Uh oh!

yongtang commented Apr 16, 2019

Uh oh!

yongtang commented Apr 18, 2019

Uh oh!

suphoff commented Apr 20, 2019

Uh oh!

yongtang commented Apr 21, 2019

Uh oh!

suphoff commented Apr 22, 2019

Uh oh!

yongtang commented Apr 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yongtang commented Apr 14, 2019 •

edited

Loading

yongtang commented Apr 16, 2019 •

edited

Loading