Skip to content

Conversation

@yongtang
Copy link
Member

This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in #382 and #366.

Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file.

In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow.

The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement __len__ and __getitem__.

With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the batch_size to be fed in tf.keras.

The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work.

This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see #384 for similar changes.

Signed-off-by: Yong Tang [email protected]

@yongtang
Copy link
Member Author

/cc @CaptainDuke to take a look as well. Think list_hdf5_datasets and read_hdf5 can help ease the usage of hdf5 with TensorFlow in 2.0 (with eager mode).

This PR is part of the effort in enhancing performance and
ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366.

Previously, HDF5Dataset is relatively manual and user
has to find out the dataset (columns) in hdf5 file.

In this PR, the idea is to allow user to use list_hdf5_datasets
to probe the shape, dtype, and name of the datasets within HDF5.
A subsequent call to read_hdf5 will bring content to a shaped Tensor
so that it could be used later in TensorFlow.

The read_hdf5 has the option to specify a slice (or a subblock) of the
dataset. This should open up possibility in the future to allow binding
a class with a hdf5 file by implement `__len__` and `__getitem__`.

With list_hdf5_datasets and read_hdf5 ops, it is also possible to
ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset
could juse call list_hdf5_datasets to find out all the necessary
information, then calling read_hdf5 in pieces to maintain the `batch_size`
to be fed in tf.keras.

The limitation would be in graph mode as in graph mode user still has to
specify almost everything dtype, shape, name for HDF5Dataset to work.

This PR has not changed HDF5Dataset implementation to use
list_hdf5_datasets and read_hdf5 ops. But this could be easily done and
see 384 for similar changes.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

yongtang commented Aug 4, 2019

@CaptainDuke I changed the start:count to start:stop, to match the suggestion in another PR:
#406 (comment)

Plan to merge this PR shortly.

@yongtang yongtang merged commit 16169c1 into tensorflow:master Aug 4, 2019
@yongtang yongtang deleted the hdf5 branch August 4, 2019 23:31
@yongtang
Copy link
Member Author

yongtang commented Aug 5, 2019

/cc @terrytangyuan in case you want to take a look at read_hdf5 which allows you to specify a start and stop to cut a slice into the Tensor.

@terrytangyuan
Copy link
Member

@yongtang Thanks!

1 similar comment
@CaptainDuke
Copy link
Contributor

@yongtang Thanks!

i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
)

* Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops

This PR is part of the effort in enhancing performance and
ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366.

Previously, HDF5Dataset is relatively manual and user
has to find out the dataset (columns) in hdf5 file.

In this PR, the idea is to allow user to use list_hdf5_datasets
to probe the shape, dtype, and name of the datasets within HDF5.
A subsequent call to read_hdf5 will bring content to a shaped Tensor
so that it could be used later in TensorFlow.

The read_hdf5 has the option to specify a slice (or a subblock) of the
dataset. This should open up possibility in the future to allow binding
a class with a hdf5 file by implement `__len__` and `__getitem__`.

With list_hdf5_datasets and read_hdf5 ops, it is also possible to
ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset
could juse call list_hdf5_datasets to find out all the necessary
information, then calling read_hdf5 in pieces to maintain the `batch_size`
to be fed in tf.keras.

The limitation would be in graph mode as in graph mode user still has to
specify almost everything dtype, shape, name for HDF5Dataset to work.

This PR has not changed HDF5Dataset implementation to use
list_hdf5_datasets and read_hdf5 ops. But this could be easily done and
see 384 for similar changes.

Signed-off-by: Yong Tang <[email protected]>

* Process default value of count and start

Signed-off-by: Yong Tang <[email protected]>

* Support HDF5Datast in graph mode

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants