Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Jul 27, 2019

This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See #382 and #366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the columns of parquet file and lenth is know.

This also could open other avenues such as map parquet file with getitem and len.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The list_parquet_columns could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes.

Summary:

  1. Two basic C++ kernel ops are implemnted: list_parquet_columns and read_parquet
  2. One ParquetDataset that is python implementation only (no C++ anymore)
  3. ParquetDataset support eager and graph mode, in graph mode, dtype and shape
    are provided by user explicitly. In eager mode, only column name is needed.
  4. read_parquet works in eager and graph mode, can read records either in full, or in slices
  5. list_parquet_columns works in eager mode only (limitation).

For cache batch vs. batch in tf.keras

  1. Added a hidden capacity to adjust the cache batch size
  2. batch to be passed in tf.keras is unrelated to capacity, but we could use rebatch
    to change at the end of the pipeline.
  3. capacity could be padded to allow rebatch to only cut a slice over one chunk.
    If not padded to batch_size in tf.keras, then rebatch likely will copy over boundary.

Signed-off-by: Yong Tang [email protected]

@yongtang
Copy link
Member Author

/cc @terrytangyuan @BryanCutler @feihugis

/cc @CaptainDuke in case you are interested. I am thinking about apply similar enhancement to HDF5 as well.

@CaptainDuke
Copy link
Contributor

Many thanks to Yongtang.

Yes, actually contents in HDF5 files do not need to decode. Also I'm working on HDF5 files with diffierent size. For example.

# h5ls test_data_level_6/10.hdf5 
atk_diff            Dataset {5120, 1}
emy_vec_5           Dataset {5120, 429}
frame                    Dataset {5120, 1}
global_info              Dataset {5120, 68}
hot_label                Dataset {5120, 1}
hot_weight               Dataset {5120, 1}
img_data                 Dataset {5120, 5, 31, 31}
...

I believe such enhancement would be helpful.
BTW, is #342 issue bug related to this problem?

The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know.

@yongtang
Copy link
Member Author

@CaptainDuke the issue #342 you are referring to, might not be directly related to this problem. However, the recent changes in upstream tf.data: tensorflow/tensorflow@c5c1839

might make things complicated as we likely will need to update API pretty soon. With the ongoing rework of cache size and tf.io pipeline to interact with tf.data, it might make sense to fix that together with the PR here.

…er mode

This fix is part of the effort to improve overall Dataset for
easy access and better cache size in eager mode.
See 382 and 366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR
adds an SizedRandomAccessFile which allows to provide an optional memory buffer
as file content. This could be useful in process compression or archives
where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence
length is unknown until graph runtime. In this PR, we provide an helper function
to read the specs of parquet file and lenth is know.

This also could open other avenues such as map parquet file with __getitem__ and __len__.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The read_parquet_specs could be similarly applied to HDF5 which is more important:
HDF5 could have dataset with different sizes.

Summary:
1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet
2) One ParquetDataset that is python implementation only (no C++ anymore)
3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape
   are provided by user explicitly. In eager mode, only column name is needed.
4) read_parquet works in eager and graph mode, can read records either in full, or in slices
5) read_parquet_specs works in eager mode only (limitation).

For cache batch vs. batch in tf.keras
1) Added a hidden `capacity` to adjust the cache batch size
2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch`
   to change at the end of the pipeline.
3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk.
   If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary.

Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
@yongtang yongtang merged commit 1642da1 into tensorflow:master Aug 5, 2019
@yongtang yongtang deleted the parquet branch August 5, 2019 21:04
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
…er mode (tensorflow#384)

* Rework on ParquetDataset for easy access and better cache size in eager mode

This fix is part of the effort to improve overall Dataset for
easy access and better cache size in eager mode.
See 382 and 366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR
adds an SizedRandomAccessFile which allows to provide an optional memory buffer
as file content. This could be useful in process compression or archives
where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence
length is unknown until graph runtime. In this PR, we provide an helper function
to read the specs of parquet file and lenth is know.

This also could open other avenues such as map parquet file with __getitem__ and __len__.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The read_parquet_specs could be similarly applied to HDF5 which is more important:
HDF5 could have dataset with different sizes.

Summary:
1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet
2) One ParquetDataset that is python implementation only (no C++ anymore)
3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape
   are provided by user explicitly. In eager mode, only column name is needed.
4) read_parquet works in eager and graph mode, can read records either in full, or in slices
5) read_parquet_specs works in eager mode only (limitation).

For cache batch vs. batch in tf.keras
1) Added a hidden `capacity` to adjust the cache batch size
2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch`
   to change at the end of the pipeline.
3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk.
   If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary.

Signed-off-by: Yong Tang <[email protected]>

* Fix build failures

Signed-off-by: Yong Tang <[email protected]>

* Rename read_parquet_columns => list_parquet_columns

Signed-off-by: Yong Tang <[email protected]>

* Remove batch args, and add test in graph mode

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants