Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Aug 24, 2019

Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual __getitem__ operations.

Signed-off-by: Yong Tang [email protected]

Note: this PR depends on PR 438.

Feather is a columnar file format that often seen with pandas.
This PR adds the indexing and slicing support to bring
Feather to parity with Parquet file format, by adding
tfio.IOTensor.from_feather support so that it is possible to access
feather through natual `__getitem__` operations.

Signed-off-by: Yong Tang <[email protected]>
Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @yongtang! I was thinking, though, of working on a base IOTensor kernel that would operate on anything producing Arrow record batches. Then specific sources, like feather, could just plugin and provide the record batches. If you wouldn't mind, I could try to incorporate feather and some other sources in the future, wdyt?

const ::arrow::ipc::feather::fbs::CTable* table = ::arrow::ipc::feather::fbs::GetCTable(buffer.data());

if (table->version() < ::arrow::ipc::feather::kFeatherVersion) {
return errors::InvalidArgument("feather file is old: ", table->version(), " vs. ", ::arrow::ipc::feather::kFeatherVersion);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will error if the file is older than the feather version in use? Is that maybe too restrictive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang
Copy link
Member Author

@BryanCutler Yes overall I would like to see more inclusion of Arrow C++ if possible. A plugin to handle Arrow C++ the same way for different format, especially columnar formats (with record batches) would be great.

Would love to see any code duplication reduction, especially C++ code.

Also as mentioned in #445, it is actually possible to generate a sequence of batch size on the side way, and then inject back to map to generate a record iterator.

Even further, we also could use record batch size to take a cache (one batch at a time) even for indexing __getitem__.

I think there are lot of things and optimizations that could be done.

@BryanCutler
Copy link
Member

Thanks @yongtang , sounds good to me! Let's go ahead and merge this

@BryanCutler BryanCutler merged commit 52acbf4 into tensorflow:master Aug 27, 2019
@yongtang yongtang deleted the feather_io_tensor branch August 27, 2019 18:50
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
Note: this PR depends on PR 438.

Feather is a columnar file format that often seen with pandas.
This PR adds the indexing and slicing support to bring
Feather to parity with Parquet file format, by adding
tfio.IOTensor.from_feather support so that it is possible to access
feather through natual `__getitem__` operations.

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants