Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CSV file is one of the most widely used format and in TensorFlow's main repo there is already a CsvDataset which could be conviniently used for iteration, either feed into tf.keras directly, or access through a for loop.
There are, still some reasons to have a CSV input processor that give indexing and slicing access. The most notable reason is that while CSV file itself technically is only splittable (not truly indexable), in reality especially in data science by default CSV file is almost always loaded into memory. And because of its wide usage, it is really more convenient and more flexible to have a CSV processor that allows indexing and slicing.
This PR takes the indexing and slicing approach and built the parser on top of Arrow. One advantage of Arrow is that Arrow's CSV's parser options are closer to widely used pandas. This will allow easy usage of importing csv files created by pandas.
Signed-off-by: Yong Tang [email protected]