Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented Aug 24, 2019

HDF5 file is a widely used format. It normally stores data into each named dataset which is a block of array with shape. It is not exactly columnar as different dataset in HDF5 could have different shapes unrelated to each other. From that standpoint it is more like a storage for collections of tensors (where each dataset represent one tensor).

HDF5 does allow slicing and indexing. In fact, the slicing and indexing in HDF5 are much more powerful than many other formats.

This PR adds tfio.IOTensor.from_hdf5. It treats HDF5 as a collection of BaseIOTensor which could be further used for slicing and indexing.

Note the collection here essentially is just a dictionary of key with BaseIOTensor as the value. It is different from Columnar IOTensor's case like Parquet or Avro.

Signed-off-by: Yong Tang [email protected]

HDF5 file is a widely used format. It normally stores data into each named
`dataset` which is a block of array with shape. It is not exactly
columnar as different `dataset` in HDF5 could have different shapes
unrelated to each other. From that standpoint it is more like a storage
for collections of tensors (where each `dataset` represent one `tensor`).

HDF5 does allow slicing and indexing. In fact, the slicing and indexing
in HDF5 are much more powerful than many other formats.

This PR adds tfio.IOTensor.from_hdf5. It treats HDF5 as a collection
of BaseIOTensor which could be further used for slicing and indexing.

Note the `collection` here essentially is just a dictionary of key
with BaseIOTensor as the value. It is different from Columnar IOTensor's
case like Parquet or Avro.

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

Plan to merge this PR soon, as all tests passed now, and this PR is less impactful to other column dataset like parquet/avro.

@yongtang yongtang merged commit ff56c89 into tensorflow:master Sep 11, 2019
@yongtang yongtang deleted the hdf5_io_tensor branch September 11, 2019 18:24
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
HDF5 file is a widely used format. It normally stores data into each named
`dataset` which is a block of array with shape. It is not exactly
columnar as different `dataset` in HDF5 could have different shapes
unrelated to each other. From that standpoint it is more like a storage
for collections of tensors (where each `dataset` represent one `tensor`).

HDF5 does allow slicing and indexing. In fact, the slicing and indexing
in HDF5 are much more powerful than many other formats.

This PR adds tfio.IOTensor.from_hdf5. It treats HDF5 as a collection
of BaseIOTensor which could be further used for slicing and indexing.

Note the `collection` here essentially is just a dictionary of key
with BaseIOTensor as the value. It is different from Columnar IOTensor's
case like Parquet or Avro.

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant