Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Aug 23, 2019

Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration.

It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility.

This PR adds tfio.IOTensor.from_avro support so that it is possible to access avro data through natual __getitem__ operations.

Signed-off-by: Yong Tang [email protected]

@yongtang yongtang force-pushed the io_tensor_avro branch 4 times, most recently from de56da6 to 8367d3e Compare August 28, 2019 01:59
@yongtang yongtang requested a review from BryanCutler August 28, 2019 04:06
@yongtang
Copy link
Member Author

@BryanCutler This PR starts with Avro though I also added an additional commit on top of it, which implemented #445 (for indexable only). It query a "capacity" function and obtained a tuple of <start, stop> with start and stop are 1-D tensors. With start and stop it is possible to predefine the batch size for each step, when iterating the dataset.

This could also be used for indexing and slicing, as we could use batch size to obtain chunks for caching (saving a chunk of tensor each time).

In case data is passed as streaming, some modification might be needed. That is because we could not return a <start, stop> pair before hand. However, we could still generate a dataset of capacity sequence, and pass this capacity on the fly (with map) so that each step is still conforming to the batch size.

@yongtang yongtang requested a review from terrytangyuan August 28, 2019 04:28
Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the details of Avro format all that well, but this looks pretty good to me @yongtang . One question I have with this, and similar ops, is that the previous Dataset could take multiple filenames as input and iterate over all of them as one dataset. Is that going to be possible with AvroIOTensor.to_dataset()?

@classmethod
def from_avro(cls,
filename,
schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the schema be read from the file as part of the init op?

Avro is a columnar file format that naturally fits into a table/column data.
Avro file itself is not directly indexable. However, it is pseudo-indexable
as it consists of blocks with each blocks specifying file offset/size, and
count of items. So indexing coulb be done by small range iteration.

It would be desirable to make Avro indexable as it will be much more
convenient with increased flexibility.

This PR adds tfio.IOTensor.from_avro support so that it is possible to acess
avro data through natual __getitem__ operations.

Signed-off-by: Yong Tang <[email protected]>
to dynamically adjust the capacity of the chunk size when reading.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

@BryanCutler There is a need to update Avro so I will just merge this PR for now. This PR also updated the method of using partitions to handle cached tensors in segments. This should open up the possibility of having multiple filenames (or repeat of the same dataset). I will create a follow up PR to add the multiple filenames support.

@yongtang yongtang merged commit 2d25fc9 into tensorflow:master Sep 19, 2019
@yongtang yongtang deleted the io_tensor_avro branch September 19, 2019 22:44
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Add tfio.IOTensor.from_avro support

Avro is a columnar file format that naturally fits into a table/column data.
Avro file itself is not directly indexable. However, it is pseudo-indexable
as it consists of blocks with each blocks specifying file offset/size, and
count of items. So indexing coulb be done by small range iteration.

It would be desirable to make Avro indexable as it will be much more
convenient with increased flexibility.

This PR adds tfio.IOTensor.from_avro support so that it is possible to acess
avro data through natual __getitem__ operations.

Signed-off-by: Yong Tang <[email protected]>

* Add a Partitions function to Avro, so that it is possible

to dynamically adjust the capacity of the chunk size when reading.

Signed-off-by: Yong Tang <[email protected]>

* Rename to io_stream.h for consistency

Signed-off-by: Yong Tang <[email protected]>

* Remove the need to pass component, unless needed explicitly

Signed-off-by: Yong Tang <[email protected]>

* Move Partitions to a generic location and support dataset

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants