Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Aug 24, 2019

CSV file is one of the most widely used format and in TensorFlow's main repo there is already a CsvDataset which could be conviniently used for iteration, either feed into tf.keras directly, or access through a for loop.

There are, still some reasons to have a CSV input processor that give indexing and slicing access. The most notable reason is that while CSV file itself technically is only splittable (not truly indexable), in reality especially in data science by default CSV file is almost always loaded into memory. And because of its wide usage, it is really more convenient and more flexible to have a CSV processor that allows indexing and slicing.

This PR takes the indexing and slicing approach and built the parser on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser options are closer to widely used pandas. This will allow easy usage of importing csv files created by pandas.

Signed-off-by: Yong Tang [email protected]

@yongtang yongtang force-pushed the csv_io_tensor branch 2 times, most recently from f4ea342 to 6cbca6c Compare August 25, 2019 19:41
…CSV parser)

CSV file is one of the most widely used format and in TensorFlow's
main repo there is already a CsvDataset which could be conviniently
used for iteration, either feed into tf.keras directly, or
access through a for loop.

There are, still some reasons to have a CSV input processor that
give indexing and slicing access. The most notable reason is that
while CSV file itself technically is only splittable (not truly indexable),
in reality especially in data science by default CSV file is almost
always loaded into memory. And because of its wide usage, it is really
more convenient and more flexible to have a CSV processor that
allows indexing and slicing.

This PR takes the indexing and slicing approach and built the parser
on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser
options are closer to widely used pandas. This will allow easy usage
of importing csv files created by pandas.

Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

Plan to merge this PR soon as well, it largely follows existing patterns. Could address in follow up PRs if any issue surface.

@yongtang yongtang merged commit 5470b5d into tensorflow:master Sep 11, 2019
@yongtang yongtang deleted the csv_io_tensor branch September 11, 2019 20:37
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
…CSV parser) (tensorflow#443)

* Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser)

CSV file is one of the most widely used format and in TensorFlow's
main repo there is already a CsvDataset which could be conviniently
used for iteration, either feed into tf.keras directly, or
access through a for loop.

There are, still some reasons to have a CSV input processor that
give indexing and slicing access. The most notable reason is that
while CSV file itself technically is only splittable (not truly indexable),
in reality especially in data science by default CSV file is almost
always loaded into memory. And because of its wide usage, it is really
more convenient and more flexible to have a CSV processor that
allows indexing and slicing.

This PR takes the indexing and slicing approach and built the parser
on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser
options are closer to widely used pandas. This will allow easy usage
of importing csv files created by pandas.

Signed-off-by: Yong Tang <[email protected]>

* Fix python 3 error

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant