Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Jul 28, 2019

This PR is part of the discussion in #366 where we want to reduce the C++ implementation by reusing primitive ops for archive support.

The idea is to implement C++ kernel that could be used in general, than adding additional C++ for each Dataset.

Summary:

  1. list_archive_entries could probe for archive format and returns entry names in archive automatically.
  2. read_archive can take the output of list_archive_entries, and returns string tensor as content of the entries in archive.
  3. both list_archive_entries and read_archive support in graph and eager mode and could be wired to implement an "ArchiveDataset" if needed (see demo in tests/test_archive_eager.py)
  4. combined with other PR Rework on ParquetDataset for easy access and better cache size in eager mode #384 it is possible to read memory and decode parquet like file in memory, by utilizing SizedRandomAccessFile which is a subclass of tensorflow's RandomAccessFile.

Overall, we expect files compressed in archive not to be too big so it should at least fit into CPU memory. While files such as HDF5 or parquet could be really large, the compression normally happens at the file format level so additional archive/compression outside with huge file (e.g. hundreds of GBs) are rarely heard of. So, the implementation in this PR should be enough for normal need.

Signed-off-by: Yong Tang [email protected]

@yongtang yongtang requested a review from terrytangyuan July 28, 2019 05:03
@yongtang yongtang changed the title Add read_archive and read_archive_entries support Add read_archive and list_archive_entries support Jul 28, 2019
This PR is part of the discussion in 366 where we want
to reduce the C++ implementation by reusing primitive ops
for archive support.

The idea is to implement C++ kernel that could be used in general,
than adding additional C++ for each Dataset.

Summary:
1) read_archive_entries could probe for archive format and returns entry
   names in archive automatically.
2) read_archive can take the output of read_archive_entries, and returns
   string tensor as content of the entries in archive.
3) both read_archive_entries and read_archive support in graph and eager mode
   and could be wired to implement an "ArchiveDataset" if needed
   (see demo in tests/test_archive_eager.py)
4) combined with other PR 384 it is possible to read memory and decode
   parquet like file in memory, by utilizing `SizedRandomAccessFile`
   which is a subclass of tensorflow's `RandomAccessFile`.

Overall, we expect files compressed in archive not to be too big so it should
at least fit into CPU memory. While files such as HDF5 or parquet could be really
large, the compression normally happens at the file format level so additional
archive/compression outside with huge file (e.g. hundreds of GBs) are rarely heard
of. So, the implementation in this PR should be enough for normal need.

Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

yongtang commented Aug 6, 2019

/cc @terrytangyuan

@terrytangyuan terrytangyuan merged commit b3e188a into tensorflow:master Aug 6, 2019
@yongtang yongtang deleted the archive branch August 6, 2019 04:12
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Add read_archive and read_archive_entries support

This PR is part of the discussion in 366 where we want
to reduce the C++ implementation by reusing primitive ops
for archive support.

The idea is to implement C++ kernel that could be used in general,
than adding additional C++ for each Dataset.

Summary:
1) read_archive_entries could probe for archive format and returns entry
   names in archive automatically.
2) read_archive can take the output of read_archive_entries, and returns
   string tensor as content of the entries in archive.
3) both read_archive_entries and read_archive support in graph and eager mode
   and could be wired to implement an "ArchiveDataset" if needed
   (see demo in tests/test_archive_eager.py)
4) combined with other PR 384 it is possible to read memory and decode
   parquet like file in memory, by utilizing `SizedRandomAccessFile`
   which is a subclass of tensorflow's `RandomAccessFile`.

Overall, we expect files compressed in archive not to be too big so it should
at least fit into CPU memory. While files such as HDF5 or parquet could be really
large, the compression normally happens at the file format level so additional
archive/compression outside with huge file (e.g. hundreds of GBs) are rarely heard
of. So, the implementation in this PR should be enough for normal need.

Signed-off-by: Yong Tang <[email protected]>

* Fix macOS build failure

Signed-off-by: Yong Tang <[email protected]>

* Fix 1.14 build failure

Signed-off-by: Yong Tang <[email protected]>

* Rename to list_archive_entries and read_archive

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants