-
Couldn't load subscription status.
- Fork 307
Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc @CaptainDuke to take a look as well. Think |
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
|
@CaptainDuke I changed the Plan to merge this PR shortly. |
|
/cc @terrytangyuan in case you want to take a look at |
|
@yongtang Thanks! |
1 similar comment
|
@yongtang Thanks! |
) * Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <[email protected]> * Process default value of count and start Signed-off-by: Yong Tang <[email protected]> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <[email protected]>
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in #382 and #366.
Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file.
In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow.
The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement
__len__and__getitem__.With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the
batch_sizeto be fed in tf.keras.The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work.
This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see #384 for similar changes.
Signed-off-by: Yong Tang [email protected]