Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443

yongtang · 2019-08-24T03:18:30Z

CSV file is one of the most widely used format and in TensorFlow's main repo there is already a CsvDataset which could be conviniently used for iteration, either feed into tf.keras directly, or access through a for loop.

There are, still some reasons to have a CSV input processor that give indexing and slicing access. The most notable reason is that while CSV file itself technically is only splittable (not truly indexable), in reality especially in data science by default CSV file is almost always loaded into memory. And because of its wide usage, it is really more convenient and more flexible to have a CSV processor that allows indexing and slicing.

This PR takes the indexing and slicing approach and built the parser on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser options are closer to widely used pandas. This will allow easy usage of importing csv files created by pandas.

Signed-off-by: Yong Tang [email protected]

…CSV parser) CSV file is one of the most widely used format and in TensorFlow's main repo there is already a CsvDataset which could be conviniently used for iteration, either feed into tf.keras directly, or access through a for loop. There are, still some reasons to have a CSV input processor that give indexing and slicing access. The most notable reason is that while CSV file itself technically is only splittable (not truly indexable), in reality especially in data science by default CSV file is almost always loaded into memory. And because of its wide usage, it is really more convenient and more flexible to have a CSV processor that allows indexing and slicing. This PR takes the indexing and slicing approach and built the parser on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser options are closer to widely used pandas. This will allow easy usage of importing csv files created by pandas. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-09-11T18:32:14Z

Plan to merge this PR soon as well, it largely follows existing patterns. Could address in follow up PRs if any issue surface.

…CSV parser) (tensorflow#443) * Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) CSV file is one of the most widely used format and in TensorFlow's main repo there is already a CsvDataset which could be conviniently used for iteration, either feed into tf.keras directly, or access through a for loop. There are, still some reasons to have a CSV input processor that give indexing and slicing access. The most notable reason is that while CSV file itself technically is only splittable (not truly indexable), in reality especially in data science by default CSV file is almost always loaded into memory. And because of its wide usage, it is really more convenient and more flexible to have a CSV processor that allows indexing and slicing. This PR takes the indexing and slicing approach and built the parser on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser options are closer to widely used pandas. This will allow easy usage of importing csv files created by pandas. Signed-off-by: Yong Tang <[email protected]> * Fix python 3 error Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the csv_io_tensor branch 2 times, most recently from f4ea342 to 6cbca6c Compare August 25, 2019 19:41

yongtang force-pushed the csv_io_tensor branch from 6cbca6c to 8f51f3a Compare September 11, 2019 17:04

yongtang added 2 commits September 11, 2019 18:26

Fix python 3 error

4e9ec5f

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the csv_io_tensor branch from 40967b5 to 4e9ec5f Compare September 11, 2019 18:31

yongtang merged commit 5470b5d into tensorflow:master Sep 11, 2019

yongtang deleted the csv_io_tensor branch September 11, 2019 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443

Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443

Uh oh!

yongtang commented Aug 24, 2019 •

edited

Loading

Uh oh!

yongtang commented Sep 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443

Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443

Uh oh!

Conversation

yongtang commented Aug 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented Sep 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yongtang commented Aug 24, 2019 •

edited

Loading