Add tfio.IOTensor.from_avro support #440

yongtang · 2019-08-23T15:50:46Z

Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration.

It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility.

This PR adds tfio.IOTensor.from_avro support so that it is possible to access avro data through natual __getitem__ operations.

Signed-off-by: Yong Tang [email protected]

yongtang · 2019-08-28T04:28:03Z

@BryanCutler This PR starts with Avro though I also added an additional commit on top of it, which implemented #445 (for indexable only). It query a "capacity" function and obtained a tuple of <start, stop> with start and stop are 1-D tensors. With start and stop it is possible to predefine the batch size for each step, when iterating the dataset.

This could also be used for indexing and slicing, as we could use batch size to obtain chunks for caching (saving a chunk of tensor each time).

In case data is passed as streaming, some modification might be needed. That is because we could not return a <start, stop> pair before hand. However, we could still generate a dataset of capacity sequence, and pass this capacity on the fly (with map) so that each step is still conforming to the batch size.

BryanCutler

I don't know the details of Avro format all that well, but this looks pretty good to me @yongtang . One question I have with this, and similar ops, is that the previous Dataset could take multiple filenames as input and iterate over all of them as one dataset. Is that going to be possible with AvroIOTensor.to_dataset()?

BryanCutler · 2019-08-29T22:58:01Z

tensorflow_io/core/python/ops/io_tensor.py

+  @classmethod
+  def from_avro(cls,
+                filename,
+                schema,


Can the schema be read from the file as part of the init op?

Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration. It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility. This PR adds tfio.IOTensor.from_avro support so that it is possible to acess avro data through natual __getitem__ operations. Signed-off-by: Yong Tang <[email protected]>

to dynamically adjust the capacity of the chunk size when reading. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-09-19T22:44:08Z

@BryanCutler There is a need to update Avro so I will just merge this PR for now. This PR also updated the method of using partitions to handle cached tensors in segments. This should open up the possibility of having multiple filenames (or repeat of the same dataset). I will create a follow up PR to add the multiple filenames support.

* Add tfio.IOTensor.from_avro support Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration. It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility. This PR adds tfio.IOTensor.from_avro support so that it is possible to acess avro data through natual __getitem__ operations. Signed-off-by: Yong Tang <[email protected]> * Add a Partitions function to Avro, so that it is possible to dynamically adjust the capacity of the chunk size when reading. Signed-off-by: Yong Tang <[email protected]> * Rename to io_stream.h for consistency Signed-off-by: Yong Tang <[email protected]> * Remove the need to pass component, unless needed explicitly Signed-off-by: Yong Tang <[email protected]> * Move Partitions to a generic location and support dataset Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the io_tensor_avro branch 4 times, most recently from de56da6 to 8367d3e Compare August 28, 2019 01:59

yongtang requested a review from BryanCutler August 28, 2019 04:06

yongtang requested a review from terrytangyuan August 28, 2019 04:28

BryanCutler approved these changes Aug 29, 2019

View reviewed changes

yongtang added 4 commits September 19, 2019 14:15

Add a Partitions function to Avro, so that it is possible

ddde2b0

to dynamically adjust the capacity of the chunk size when reading. Signed-off-by: Yong Tang <[email protected]>

Rename to io_stream.h for consistency

f23347f

Signed-off-by: Yong Tang <[email protected]>

Remove the need to pass component, unless needed explicitly

20063ad

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the io_tensor_avro branch from 8367d3e to 20063ad Compare September 19, 2019 17:54

Move Partitions to a generic location and support dataset

5759a72

Signed-off-by: Yong Tang <[email protected]>

yongtang merged commit 2d25fc9 into tensorflow:master Sep 19, 2019

yongtang deleted the io_tensor_avro branch September 19, 2019 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tfio.IOTensor.from_avro support #440

Add tfio.IOTensor.from_avro support #440

Uh oh!

yongtang commented Aug 23, 2019 •

edited

Loading

Uh oh!

yongtang commented Aug 28, 2019

Uh oh!

BryanCutler left a comment

Uh oh!

BryanCutler Aug 29, 2019

Uh oh!

yongtang commented Sep 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add tfio.IOTensor.from_avro support #440

Add tfio.IOTensor.from_avro support #440

Uh oh!

Conversation

yongtang commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented Aug 28, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

yongtang commented Sep 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yongtang commented Aug 23, 2019 •

edited

Loading