Add read_archive and list_archive_entries support #389

yongtang · 2019-07-28T05:03:45Z

This PR is part of the discussion in #366 where we want to reduce the C++ implementation by reusing primitive ops for archive support.

The idea is to implement C++ kernel that could be used in general, than adding additional C++ for each Dataset.

Summary:

list_archive_entries could probe for archive format and returns entry names in archive automatically.
read_archive can take the output of list_archive_entries, and returns string tensor as content of the entries in archive.
both list_archive_entries and read_archive support in graph and eager mode and could be wired to implement an "ArchiveDataset" if needed (see demo in tests/test_archive_eager.py)
combined with other PR Rework on ParquetDataset for easy access and better cache size in eager mode #384 it is possible to read memory and decode parquet like file in memory, by utilizing SizedRandomAccessFile which is a subclass of tensorflow's RandomAccessFile.

Overall, we expect files compressed in archive not to be too big so it should at least fit into CPU memory. While files such as HDF5 or parquet could be really large, the compression normally happens at the file format level so additional archive/compression outside with huge file (e.g. hundreds of GBs) are rarely heard of. So, the implementation in this PR should be enough for normal need.

Signed-off-by: Yong Tang [email protected]

This PR is part of the discussion in 366 where we want to reduce the C++ implementation by reusing primitive ops for archive support. The idea is to implement C++ kernel that could be used in general, than adding additional C++ for each Dataset. Summary: 1) read_archive_entries could probe for archive format and returns entry names in archive automatically. 2) read_archive can take the output of read_archive_entries, and returns string tensor as content of the entries in archive. 3) both read_archive_entries and read_archive support in graph and eager mode and could be wired to implement an "ArchiveDataset" if needed (see demo in tests/test_archive_eager.py) 4) combined with other PR 384 it is possible to read memory and decode parquet like file in memory, by utilizing `SizedRandomAccessFile` which is a subclass of tensorflow's `RandomAccessFile`. Overall, we expect files compressed in archive not to be too big so it should at least fit into CPU memory. While files such as HDF5 or parquet could be really large, the compression normally happens at the file format level so additional archive/compression outside with huge file (e.g. hundreds of GBs) are rarely heard of. So, the implementation in this PR should be enough for normal need. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-08-06T00:22:05Z

/cc @terrytangyuan

* Add read_archive and read_archive_entries support This PR is part of the discussion in 366 where we want to reduce the C++ implementation by reusing primitive ops for archive support. The idea is to implement C++ kernel that could be used in general, than adding additional C++ for each Dataset. Summary: 1) read_archive_entries could probe for archive format and returns entry names in archive automatically. 2) read_archive can take the output of read_archive_entries, and returns string tensor as content of the entries in archive. 3) both read_archive_entries and read_archive support in graph and eager mode and could be wired to implement an "ArchiveDataset" if needed (see demo in tests/test_archive_eager.py) 4) combined with other PR 384 it is possible to read memory and decode parquet like file in memory, by utilizing `SizedRandomAccessFile` which is a subclass of tensorflow's `RandomAccessFile`. Overall, we expect files compressed in archive not to be too big so it should at least fit into CPU memory. While files such as HDF5 or parquet could be really large, the compression normally happens at the file format level so additional archive/compression outside with huge file (e.g. hundreds of GBs) are rarely heard of. So, the implementation in this PR should be enough for normal need. Signed-off-by: Yong Tang <[email protected]> * Fix macOS build failure Signed-off-by: Yong Tang <[email protected]> * Fix 1.14 build failure Signed-off-by: Yong Tang <[email protected]> * Rename to list_archive_entries and read_archive Signed-off-by: Yong Tang <[email protected]>

yongtang requested a review from terrytangyuan July 28, 2019 05:03

yongtang force-pushed the archive branch from f2a2925 to 85bb863 Compare July 28, 2019 06:52

yongtang changed the title ~~Add read_archive and read_archive_entries support~~ Add read_archive and list_archive_entries support Jul 28, 2019

yongtang force-pushed the archive branch from 82b6d65 to 1226e25 Compare August 4, 2019 01:17

yongtang added 4 commits August 5, 2019 15:19

Fix macOS build failure

3c6d7f8

Signed-off-by: Yong Tang <[email protected]>

Fix 1.14 build failure

1edbcbb

Signed-off-by: Yong Tang <[email protected]>

Rename to list_archive_entries and read_archive

6ae514b

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the archive branch from 1226e25 to 6ae514b Compare August 5, 2019 15:19

terrytangyuan approved these changes Aug 6, 2019

View reviewed changes

terrytangyuan merged commit b3e188a into tensorflow:master Aug 6, 2019

yongtang deleted the archive branch August 6, 2019 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add read_archive and list_archive_entries support #389

Add read_archive and list_archive_entries support #389

Uh oh!

yongtang commented Jul 28, 2019 •

edited

Loading

Uh oh!

yongtang commented Aug 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add read_archive and list_archive_entries support #389

Add read_archive and list_archive_entries support #389

Uh oh!

Conversation

yongtang commented Jul 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented Aug 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yongtang commented Jul 28, 2019 •

edited

Loading