-
Notifications
You must be signed in to change notification settings - Fork 307
Add tfio.IOTensor.from_feather support #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a504bf0 to
497290a
Compare
Note: this PR depends on PR 438. Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual `__getitem__` operations. Signed-off-by: Yong Tang <[email protected]>
497290a to
d55a9b5
Compare
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @yongtang! I was thinking, though, of working on a base IOTensor kernel that would operate on anything producing Arrow record batches. Then specific sources, like feather, could just plugin and provide the record batches. If you wouldn't mind, I could try to incorporate feather and some other sources in the future, wdyt?
| const ::arrow::ipc::feather::fbs::CTable* table = ::arrow::ipc::feather::fbs::GetCTable(buffer.data()); | ||
|
|
||
| if (table->version() < ::arrow::ipc::feather::kFeatherVersion) { | ||
| return errors::InvalidArgument("feather file is old: ", table->version(), " vs. ", ::arrow::ipc::feather::kFeatherVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will error if the file is older than the feather version in use? Is that maybe too restrictive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@BryanCutler Yes overall I would like to see more inclusion of Arrow C++ if possible. A plugin to handle Arrow C++ the same way for different format, especially columnar formats (with record batches) would be great. Would love to see any code duplication reduction, especially C++ code. Also as mentioned in #445, it is actually possible to generate a sequence of batch size on the side way, and then inject back to map to generate a record iterator. Even further, we also could use record batch size to take a cache (one batch at a time) even for indexing I think there are lot of things and optimizations that could be done. |
|
Thanks @yongtang , sounds good to me! Let's go ahead and merge this |
Note: this PR depends on PR 438. Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual `__getitem__` operations. Signed-off-by: Yong Tang <[email protected]>
Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual
__getitem__operations.Signed-off-by: Yong Tang [email protected]