Skip to content

Conversation

@yongtang
Copy link
Member

@yongtang yongtang commented May 4, 2019

This PR updates LMDBDataset to support batch at the creation,
it also switch to the new pattern to reduce unnecessary code
duplication.

Signed-off-by: Yong Tang [email protected]

@yongtang yongtang requested a review from terrytangyuan May 4, 2019 15:37
@yongtang
Copy link
Member Author

yongtang commented May 4, 2019

Also /cc @captain-pool

@captain-pool
Copy link

This looks really helpful. Thanks @yongtang

@yongtang
Copy link
Member Author

yongtang commented May 4, 2019

@captain-pool To explain further, with the new pattern, the only real method you need to implement to add a new data format is:

Status ReadRecord(
    io::InputStreamInterface* s,
    IteratorContext* ctx,
    std::unique_ptr<LMDBInputStream>& state,
    int64 record_to_read,
    int64* record_read,
    std::vector<Tensor>* out_tensors)
  • io::InputStreamInterface* s is the input stream you get (for the file object). You can use ReadNBytes to access data. Keep in mind some stream may not be able to reset back to offset 0.
  • IteratorContext* ctx is needed because you need ctx to access the allocator in order to build tensor.
  • int64 record_to_read is the number of record (batch size) requested, int64* record_read is the record you returned (you may return less than requested data) after processing the data.
  • std::vector<Tensor>* out_tensors output tensor. If you have multiple tensors (e.g, one int and one string) for output then you append to the vector.

The ReadRecord is a stateful operation so you need a place to store your state. Next ReadRecord will start from the last op. In certain situations the state could be simply a 'int offset'. In certain situations the state will be much more complicated such as LMDB.

You could define the class of the state type in template.

In case of LMDB, I wrapped the state into a LMDBInputStream so there is a std::unique_ptr<LMDBInputStream>& state.

In case of CIFAR (https://github.com/tensorflow/io/blob/master/tensorflow_io/cifar/kernels/cifar_dataset_ops.cc),
the needed state is really just an int64 offset. So there is a std::unique_ptr<int64>& state.

Let me know if you have any questions.

@captain-pool
Copy link

Will do, Thanks Yong!

```
Args:
filenames: A `tf.string` tensor containing one or more filenames.
filename: A `tf.string` tensor containing one or more filenames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input is slightly different than before, in that now it is a 1D or 2D tensor (not a list of tensors). Think it makes sense to remove s as we consider tensor as singular in Tensorflow overall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would break backward compatibility though. Should we add a deprecation notice or are we planning a non-minor release?

@yongtang
Copy link
Member Author

yongtang commented May 4, 2019

@terrytangyuan Added a test case for LMDB with batch.

@terrytangyuan
Copy link
Member

terrytangyuan commented May 5, 2019

Thanks. Just another comment on the arg name change. Travis build failed though.

@yongtang
Copy link
Member Author

yongtang commented May 5, 2019

The build is failing as bazel does not have a rule to alias repo, and grpc assume zlib has to be named com_github_madler_zlib (not zlib).

Will need to update the WORKSPACE and several BUILD files to fix.

Feels like more than half of the time I am on fixing bazel issues, not actually implementation. ~

@yongtang
Copy link
Member Author

yongtang commented May 5, 2019

@terrytangyuan With upcoming TF 2.0, a lot of API will be changed. With v2 version of the tf.data.Dataset many public method will be gone. I was trying to move forward with #195, but noticed that once we switch to v2, the graph mode will not work for most of the cases.

It comes to a point when we want to move to TF 2.0. I was thinking after the release of TF 1.14 we could release a version of 1.14, then switch. May still have to wait until 1.14 is released.

@terrytangyuan
Copy link
Member

@yongtang Thanks for the clarification. Should we rebase here now that the other PR on com_github_madler_zlib has been merged?

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This PR updates LMDBDataset to support batch at the creation,
it also switch to the new pattern to reduce unnecessary code
duplication.

Signed-off-by: Yong Tang <[email protected]>

Add test case for LMDB with batch

Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

yongtang commented May 6, 2019

All tests passed now. Also updated filenames to leave it the same as before.

@yongtang yongtang merged commit 1a400a2 into tensorflow:master May 6, 2019
@yongtang yongtang deleted the lmdb branch May 6, 2019 14:34
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
Update LMDBDataset to support batch at the creation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants