Add tfio.IOTensor.from_feather support #442

yongtang · 2019-08-24T03:17:20Z

Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual __getitem__ operations.

Signed-off-by: Yong Tang [email protected]

Note: this PR depends on PR 438. Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual `__getitem__` operations. Signed-off-by: Yong Tang <[email protected]>

BryanCutler

This is great @yongtang! I was thinking, though, of working on a base IOTensor kernel that would operate on anything producing Arrow record batches. Then specific sources, like feather, could just plugin and provide the record batches. If you wouldn't mind, I could try to incorporate feather and some other sources in the future, wdyt?

BryanCutler · 2019-08-26T21:23:06Z

tensorflow_io/arrow/kernels/arrow_kernels.cc

+    const ::arrow::ipc::feather::fbs::CTable* table = ::arrow::ipc::feather::fbs::GetCTable(buffer.data());
+
+    if (table->version() < ::arrow::ipc::feather::kFeatherVersion) {
+      return errors::InvalidArgument("feather file is old: ", table->version(), " vs. ", ::arrow::ipc::feather::kFeatherVersion);


This will error if the file is older than the feather version in use? Is that maybe too restrictive?

@BryanCutler This is aligned with arrow:
https://github.com/apache/arrow/blob/438a140142be423b1b2af2399567a0a8aeba9aa1/cpp/src/arrow/ipc/feather_internal.h#L129-L132

yongtang · 2019-08-27T00:14:41Z

@BryanCutler Yes overall I would like to see more inclusion of Arrow C++ if possible. A plugin to handle Arrow C++ the same way for different format, especially columnar formats (with record batches) would be great.

Would love to see any code duplication reduction, especially C++ code.

Also as mentioned in #445, it is actually possible to generate a sequence of batch size on the side way, and then inject back to map to generate a record iterator.

Even further, we also could use record batch size to take a cache (one batch at a time) even for indexing __getitem__.

I think there are lot of things and optimizations that could be done.

BryanCutler · 2019-08-27T17:57:53Z

Thanks @yongtang , sounds good to me! Let's go ahead and merge this

Note: this PR depends on PR 438. Feather is a columnar file format that often seen with pandas. This PR adds the indexing and slicing support to bring Feather to parity with Parquet file format, by adding tfio.IOTensor.from_feather support so that it is possible to access feather through natual `__getitem__` operations. Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the feather_io_tensor branch from a504bf0 to 497290a Compare August 25, 2019 19:03

yongtang force-pushed the feather_io_tensor branch from 497290a to d55a9b5 Compare August 25, 2019 19:39

yongtang requested review from BryanCutler and terrytangyuan August 26, 2019 16:46

BryanCutler reviewed Aug 26, 2019

View reviewed changes

BryanCutler merged commit 52acbf4 into tensorflow:master Aug 27, 2019

yongtang deleted the feather_io_tensor branch August 27, 2019 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tfio.IOTensor.from_feather support #442

Add tfio.IOTensor.from_feather support #442

Uh oh!

yongtang commented Aug 24, 2019 •

edited

Loading

Uh oh!

BryanCutler left a comment

Uh oh!

BryanCutler Aug 26, 2019

Uh oh!

yongtang Aug 27, 2019

Uh oh!

yongtang commented Aug 27, 2019

Uh oh!

BryanCutler commented Aug 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add tfio.IOTensor.from_feather support #442

Add tfio.IOTensor.from_feather support #442

Uh oh!

Conversation

yongtang commented Aug 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Aug 26, 2019

Choose a reason for hiding this comment

Uh oh!

yongtang Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

yongtang commented Aug 27, 2019

Uh oh!

BryanCutler commented Aug 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yongtang commented Aug 24, 2019 •

edited

Loading