Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392

yongtang · 2019-07-28T23:13:12Z

This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in #382 and #366.

Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file.

In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow.

The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement __len__ and __getitem__.

With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the batch_size to be fed in tf.keras.

The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work.

This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see #384 for similar changes.

Signed-off-by: Yong Tang [email protected]

yongtang · 2019-07-28T23:14:20Z

/cc @CaptainDuke to take a look as well. Think list_hdf5_datasets and read_hdf5 can help ease the usage of hdf5 with TensorFlow in 2.0 (with eager mode).

This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-08-04T15:47:21Z

@CaptainDuke I changed the start:count to start:stop, to match the suggestion in another PR:
#406 (comment)

Plan to merge this PR shortly.

yongtang · 2019-08-05T15:12:38Z

/cc @terrytangyuan in case you want to take a look at read_hdf5 which allows you to specify a start and stop to cut a slice into the Tensor.

terrytangyuan · 2019-08-05T16:27:39Z

@yongtang Thanks!

CaptainDuke · 2019-08-06T03:10:07Z

@yongtang Thanks!

) * Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <[email protected]> * Process default value of count and start Signed-off-by: Yong Tang <[email protected]> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <[email protected]>

yongtang requested review from BryanCutler and terrytangyuan July 28, 2019 23:13

yongtang added kokoro:force-run kokoro:run Kokoro CI labels Jul 29, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jul 29, 2019

yongtang added kokoro:force-run kokoro:run Kokoro CI labels Jul 29, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jul 29, 2019

yongtang force-pushed the hdf5 branch from 132b64c to 269bf2b Compare July 31, 2019 05:42

yongtang added kokoro:force-run kokoro:run Kokoro CI labels Jul 31, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jul 31, 2019

yongtang mentioned this pull request Jul 31, 2019

Discuss Batch Standards in TFIO with Keras #382

Open

yongtang added 3 commits August 4, 2019 00:57

Process default value of count and start

b44f22c

Signed-off-by: Yong Tang <[email protected]>

Support HDF5Datast in graph mode

a2774f2

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the hdf5 branch from caf7cb8 to a2774f2 Compare August 4, 2019 01:04

yongtang merged commit 16169c1 into tensorflow:master Aug 4, 2019

yongtang deleted the hdf5 branch August 4, 2019 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392

Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392

Uh oh!

yongtang commented Jul 28, 2019

Uh oh!

yongtang commented Jul 28, 2019

Uh oh!

yongtang commented Aug 4, 2019

Uh oh!

yongtang commented Aug 5, 2019

Uh oh!

terrytangyuan commented Aug 5, 2019

Uh oh!

CaptainDuke commented Aug 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392

Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392

Uh oh!

Conversation

yongtang commented Jul 28, 2019

Uh oh!

yongtang commented Jul 28, 2019

Uh oh!

yongtang commented Aug 4, 2019

Uh oh!

yongtang commented Aug 5, 2019

Uh oh!

terrytangyuan commented Aug 5, 2019

Uh oh!

CaptainDuke commented Aug 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants