-
Notifications
You must be signed in to change notification settings - Fork 7.2k
add prototype datasets for MNIST and variants #4512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ejguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have one concern for QMNIST as shown below
| LZMA = "lzma" | ||
|
|
||
|
|
||
| class Decompressor(IterDataPipe[Tuple[str, io.IOBase]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding it. We may take reference from it for TorchData in the future. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may take reference from it for TorchData in the future.
Please do so for every IterDataPipe in this file. I've only put them there if I thought they are general enough to be used by multiple datasets / torchdata.
| if config.split == "test10k": | ||
| start = 0 | ||
| stop = 10000 | ||
| else: # config.split == "test50k" | ||
| start = 10000 | ||
| stop = None | ||
|
|
||
| return Slicer(dp, start=start, stop=stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there way we can untackle test10k and test50k at the resource stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, no. The earliest would be in the MNISTFileReader that could enumerate the chunks and discard the unused chunks directly. In fact, that would improve performance since otherwise we also decode the chunks that we throw away later. I'll send a patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commits include this change
ejguan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment about FC for web stream
| start = self.start or 0 | ||
| stop = self.stop or num_samples | ||
|
|
||
| file.seek(start * chunk_size, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also do try-except here? Like HTTPResponse for web stream, it's not seekable, then we have to use read(bytes) to bypass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of it. This datapipe is specifically for MNIST files, which always be seek'able. What scenario do you have in mind where we get the data from a HTTPResponse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if you want to support streaming the Dataset from remote in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about that in our sync.
Summary: * add prototype datasets for MNIST and variants * fix mypy * fix EMNIST labels * fix code format * avoid encoding + decoding in every step * discard data at the binary level instead of after decoding * cleanup * fix mypy Reviewed By: NicolasHug Differential Revision: D31505561 fbshipit-source-id: 7ac988ec660c9761edf51280640f318076e0ba75
* add prototype datasets for MNIST and variants * fix mypy * fix EMNIST labels * fix code format * avoid encoding + decoding in every step * discard data at the binary level instead of after decoding * cleanup * fix mypy
* add prototype datasets for MNIST and variants * fix mypy * fix EMNIST labels * fix code format * avoid encoding + decoding in every step * discard data at the binary level instead of after decoding * cleanup * fix mypy
cc @pmeier @mthrok @bjuncek