Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Jul 30, 2019

This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See #382 and #366 for related discussions.

Summary:

  1. read_text is able to read a text file with in the range of [offset, offset+length]
  2. that gives us the Splittable text file where we could read file in chunks (similar to hadoop)
  3. the plan is to read a text file in big chunks and then wire up with tf.data.Dataset
  4. read_text is a primitive C++ op so it could be used in tf.data, and it could be used in other places.

Signed-off-by: Yong Tang [email protected]

yongtang added 3 commits July 31, 2019 23:53
This PR is part of the effort to rework on Dataset with
large files reading into Tensors first to speed up performance.
See 382 and 366 for related discussions.

Summary:
1) read_text is able to read a text file with in the range of [offset, offset+length]
2) that gives us the Splittable text file where we could read file in chunks (similar to hadoop)
3) the plan is to read a text file in big chunks and then wire up with tf.data.Dataset
4) read_text is a primitive C++ op so it could be used in tf.data, and it could be used in other places.

Note once PR 393 is merged I will convert TextDataset to use this ops
(and remove the native C++ implementation of TextDataset)

Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

yongtang commented Aug 4, 2019

Will merge this PR soon as well. It exposes a primitive kernel op read_text which allows reading text in slices. This is more useful to tf.data API in most of the cases.

@yongtang yongtang merged commit a8506f6 into tensorflow:master Aug 4, 2019
@yongtang yongtang deleted the read_text branch August 4, 2019 18:53
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Add read_text to read lines from splittable text file

This PR is part of the effort to rework on Dataset with
large files reading into Tensors first to speed up performance.
See 382 and 366 for related discussions.

Summary:
1) read_text is able to read a text file with in the range of [offset, offset+length]
2) that gives us the Splittable text file where we could read file in chunks (similar to hadoop)
3) the plan is to read a text file in big chunks and then wire up with tf.data.Dataset
4) read_text is a primitive C++ op so it could be used in tf.data, and it could be used in other places.

Note once PR 393 is merged I will convert TextDataset to use this ops
(and remove the native C++ implementation of TextDataset)

Signed-off-by: Yong Tang <[email protected]>

* Use read_text to implement TextDataset

Signed-off-by: Yong Tang <[email protected]>

* Fix python 3 failure

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants