Update LMDBDataset to support batch at the creation #213

yongtang · 2019-05-04T15:37:09Z

This PR updates LMDBDataset to support batch at the creation,
it also switch to the new pattern to reduce unnecessary code
duplication.

Signed-off-by: Yong Tang [email protected]

yongtang · 2019-05-04T15:37:41Z

Also /cc @captain-pool

captain-pool · 2019-05-04T15:42:57Z

This looks really helpful. Thanks @yongtang

yongtang · 2019-05-04T15:54:20Z

@captain-pool To explain further, with the new pattern, the only real method you need to implement to add a new data format is:

Status ReadRecord(
    io::InputStreamInterface* s,
    IteratorContext* ctx,
    std::unique_ptr<LMDBInputStream>& state,
    int64 record_to_read,
    int64* record_read,
    std::vector<Tensor>* out_tensors)

io::InputStreamInterface* s is the input stream you get (for the file object). You can use ReadNBytes to access data. Keep in mind some stream may not be able to reset back to offset 0.
IteratorContext* ctx is needed because you need ctx to access the allocator in order to build tensor.
int64 record_to_read is the number of record (batch size) requested, int64* record_read is the record you returned (you may return less than requested data) after processing the data.
std::vector<Tensor>* out_tensors output tensor. If you have multiple tensors (e.g, one int and one string) for output then you append to the vector.

The ReadRecord is a stateful operation so you need a place to store your state. Next ReadRecord will start from the last op. In certain situations the state could be simply a 'int offset'. In certain situations the state will be much more complicated such as LMDB.

You could define the class of the state type in template.

In case of LMDB, I wrapped the state into a LMDBInputStream so there is a std::unique_ptr<LMDBInputStream>& state.

In case of CIFAR (https://github.com/tensorflow/io/blob/master/tensorflow_io/cifar/kernels/cifar_dataset_ops.cc),
the needed state is really just an int64 offset. So there is a std::unique_ptr<int64>& state.

Let me know if you have any questions.

captain-pool · 2019-05-04T16:01:15Z

Will do, Thanks Yong!

terrytangyuan · 2019-05-04T17:45:55Z

tensorflow_io/lmdb/python/ops/lmdb_ops.py

    ```
    Args:
-      filenames: A `tf.string` tensor containing one or more filenames.
+      filename: A `tf.string` tensor containing one or more filenames.


Is there a particular reason for this change?

The input is slightly different than before, in that now it is a 1D or 2D tensor (not a list of tensors). Think it makes sense to remove s as we consider tensor as singular in Tensorflow overall.

This would break backward compatibility though. Should we add a deprecation notice or are we planning a non-minor release?

tensorflow_io/lmdb/ops/lmdb_ops.cc

tensorflow_io/lmdb/python/ops/lmdb_ops.py

yongtang · 2019-05-04T19:53:19Z

@terrytangyuan Added a test case for LMDB with batch.

terrytangyuan · 2019-05-05T01:00:12Z

Thanks. Just another comment on the arg name change. Travis build failed though.

yongtang · 2019-05-05T16:57:39Z

The build is failing as bazel does not have a rule to alias repo, and grpc assume zlib has to be named com_github_madler_zlib (not zlib).

Will need to update the WORKSPACE and several BUILD files to fix.

Feels like more than half of the time I am on fixing bazel issues, not actually implementation. ~

yongtang · 2019-05-05T17:21:42Z

@terrytangyuan With upcoming TF 2.0, a lot of API will be changed. With v2 version of the tf.data.Dataset many public method will be gone. I was trying to move forward with #195, but noticed that once we switch to v2, the graph mode will not work for most of the cases.

It comes to a point when we want to move to TF 2.0. I was thinking after the release of TF 1.14 we could release a version of 1.14, then switch. May still have to wait until 1.14 is released.

terrytangyuan · 2019-05-06T00:43:38Z

@yongtang Thanks for the clarification. Should we rebase here now that the other PR on com_github_madler_zlib has been merged?

terrytangyuan

LGTM

This PR updates LMDBDataset to support batch at the creation, it also switch to the new pattern to reduce unnecessary code duplication. Signed-off-by: Yong Tang <[email protected]> Add test case for LMDB with batch Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-05-06T14:34:44Z

All tests passed now. Also updated filenames to leave it the same as before.

Update LMDBDataset to support batch at the creation

yongtang requested a review from terrytangyuan May 4, 2019 15:37

terrytangyuan reviewed May 4, 2019

View reviewed changes

yongtang force-pushed the lmdb branch from dcd4ed9 to 55da31d Compare May 4, 2019 21:46

yongtang force-pushed the lmdb branch from 55da31d to 148ca7d Compare May 5, 2019 03:08

terrytangyuan approved these changes May 6, 2019

View reviewed changes

yongtang force-pushed the lmdb branch from 148ca7d to c370044 Compare May 6, 2019 05:09

yongtang merged commit 1a400a2 into tensorflow:master May 6, 2019

yongtang deleted the lmdb branch May 6, 2019 14:34

i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021

Merge pull request tensorflow#213 from yongtang/lmdb

be79c3e

Update LMDBDataset to support batch at the creation

Update LMDBDataset to support batch at the creation #213

Update LMDBDataset to support batch at the creation #213

Uh oh!

Conversation

yongtang commented May 4, 2019

Uh oh!

yongtang commented May 4, 2019

Uh oh!

captain-pool commented May 4, 2019

Uh oh!

yongtang commented May 4, 2019

Uh oh!

captain-pool commented May 4, 2019

Uh oh!

terrytangyuan May 4, 2019

Choose a reason for hiding this comment

Uh oh!

yongtang May 4, 2019

Choose a reason for hiding this comment

Uh oh!

terrytangyuan May 5, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yongtang commented May 4, 2019

Uh oh!

terrytangyuan commented May 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented May 5, 2019

Uh oh!

yongtang commented May 5, 2019

Uh oh!

terrytangyuan commented May 6, 2019

Uh oh!

terrytangyuan left a comment

Choose a reason for hiding this comment

Uh oh!

yongtang commented May 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

terrytangyuan commented May 5, 2019 •

edited

Loading