-
Notifications
You must be signed in to change notification settings - Fork 307
[WIP] Update Image to use variant for reference image objects #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc @suphoff |
|
Some additional items:
|
61d2060 to
26032f6
Compare
Next step is to expose attributes for variant Signed-off-by: Yong Tang <[email protected]>
|
With PR #184 merged, this PR has been rebased and all tests passed. /cc @terrytangyuan to take a look as well. |
| """ | ||
|
|
||
| def __init__(self, filename): | ||
| """Create a ImageReader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImageReader -> ImageDataset
Signed-off-by: Yong Tang <[email protected]>
|
@suphoff Yes the idea is about right. There are several additional notes here:
As we discussed the idea is to store metadata and this metadata does not necessarily related to Dataset. But we also want to make sure the graph creation fits into the |
|
@suphoff Adding You can take a look at the io/tensorflow_io/core/kernels/dataset_ops.h Lines 307 to 325 in 4019a0c
|
|
@suphoff One additional note about As text files are separated by lines and there is nothing else, the function above does not do anything. |
|
Let me change the PR to Working-in-Progress, so that we could have additional thinking. There are several things we want to solve here:
Variant Tensor is immutable, maybe we could have a pass through operation to explicitly read data, e.g., We can pass through the content (no do anything) if the input has already been resolved, e.g., |
|
@yongtang : Variants may be immutable - but you can wrap a pointer to a reference counted C++ object into one - just like the Dataset implementation. As for distribution across host/device I see two scenarios
I just don't see a lot of usage of enumerating files matched by wildcard on one host and sending the filenames to a second host - but admit I could be totally wrong here. A second issue is that I have not investigated with reference counted objects wrapped in variant tensors is graph saving/restore ( Just haven't looked into graph save/restore yet) Happy to discuss in a VC or on gitter. |
|
@suphoff it may not be a big issue for images as images are likely small in size. But for other data formats the data could be huge like GBs of data (and only part or a small chunk of data are truly used). In that situation, serialize and passing tensors of GB size around from one host, is not efficient. The reference of filename/entry will help distribute the with data that are not dense. The Dataset in tensorflow was designed as an iterable or iterator, so it is not suited for distributed systems. The distribute strategy helps to an extent, but it still only applies to dense dataset where every bytes will be used, not other format where only a part or small chunks of data needed. |
|
@yongtang : I agree serializing GBs of data would not be a good solution. |
|
@suphoff Eventually if each file (or archived object) is still too large, it is possible to split the file or object so that each host is only responsible for a chunk of the data. So host one processes |
Next step is to expose attributes for variant
Signed-off-by: Yong Tang [email protected]