-
Notifications
You must be signed in to change notification settings - Fork 307
Add tfio.IOTensor.from_avro support #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
de56da6 to
8367d3e
Compare
|
@BryanCutler This PR starts with Avro though I also added an additional commit on top of it, which implemented #445 (for indexable only). It query a "capacity" function and obtained a tuple of This could also be used for indexing and slicing, as we could use batch size to obtain chunks for caching (saving a chunk of tensor each time). In case data is passed as streaming, some modification might be needed. That is because we could not return a |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the details of Avro format all that well, but this looks pretty good to me @yongtang . One question I have with this, and similar ops, is that the previous Dataset could take multiple filenames as input and iterate over all of them as one dataset. Is that going to be possible with AvroIOTensor.to_dataset()?
| @classmethod | ||
| def from_avro(cls, | ||
| filename, | ||
| schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the schema be read from the file as part of the init op?
Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration. It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility. This PR adds tfio.IOTensor.from_avro support so that it is possible to acess avro data through natual __getitem__ operations. Signed-off-by: Yong Tang <[email protected]>
to dynamically adjust the capacity of the chunk size when reading. Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
8367d3e to
20063ad
Compare
Signed-off-by: Yong Tang <[email protected]>
|
@BryanCutler There is a need to update Avro so I will just merge this PR for now. This PR also updated the method of using partitions to handle cached tensors in segments. This should open up the possibility of having multiple filenames (or repeat of the same dataset). I will create a follow up PR to add the multiple filenames support. |
* Add tfio.IOTensor.from_avro support Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration. It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility. This PR adds tfio.IOTensor.from_avro support so that it is possible to acess avro data through natual __getitem__ operations. Signed-off-by: Yong Tang <[email protected]> * Add a Partitions function to Avro, so that it is possible to dynamically adjust the capacity of the chunk size when reading. Signed-off-by: Yong Tang <[email protected]> * Rename to io_stream.h for consistency Signed-off-by: Yong Tang <[email protected]> * Remove the need to pass component, unless needed explicitly Signed-off-by: Yong Tang <[email protected]> * Move Partitions to a generic location and support dataset Signed-off-by: Yong Tang <[email protected]>
Avro is a columnar file format that naturally fits into a table/column data. Avro file itself is not directly indexable. However, it is pseudo-indexable as it consists of blocks with each blocks specifying file offset/size, and count of items. So indexing coulb be done by small range iteration.
It would be desirable to make Avro indexable as it will be much more convenient with increased flexibility.
This PR adds tfio.IOTensor.from_avro support so that it is possible to access
avrodata through natual__getitem__operations.Signed-off-by: Yong Tang [email protected]