add prototype datasets for MNIST and variants #4512

pmeier · 2021-09-30T13:45:07Z

ejguan

I only have one concern for QMNIST as shown below

ejguan · 2021-10-05T19:26:05Z

torchvision/prototype/datasets/utils/_internal.py

+    LZMA = "lzma"
+
+
+class Decompressor(IterDataPipe[Tuple[str, io.IOBase]]):


Thanks for adding it. We may take reference from it for TorchData in the future. 😄

We may take reference from it for TorchData in the future.

Please do so for every IterDataPipe in this file. I've only put them there if I thought they are general enough to be used by multiple datasets / torchdata.

ejguan · 2021-10-05T21:30:51Z

torchvision/prototype/datasets/_builtin/mnist.py

+        if config.split == "test10k":
+            start = 0
+            stop = 10000
+        else:  # config.split == "test50k"
+            start = 10000
+            stop = None
+
+        return Slicer(dp, start=start, stop=stop)


Is there way we can untackle test10k and test50k at the resource stage?

Unfortunately, no. The earliest would be in the MNISTFileReader that could enumerate the chunks and discard the unused chunks directly. In fact, that would improve performance since otherwise we also decode the chunks that we throw away later. I'll send a patch.

The latest commits include this change

ejguan

LGTM with one comment about FC for web stream

ejguan · 2021-10-06T14:02:59Z

torchvision/prototype/datasets/_builtin/mnist.py

+            start = self.start or 0
+            stop = self.stop or num_samples
+
+            file.seek(start * chunk_size, 1)


Could we also do try-except here? Like HTTPResponse for web stream, it's not seekable, then we have to use read(bytes) to bypass it.

I don't see the point of it. This datapipe is specifically for MNIST files, which always be seek'able. What scenario do you have in mind where we get the data from a HTTPResponse?

I was wondering if you want to support streaming the Dataset from remote in the future.

Let's talk about that in our sync.

Summary: * add prototype datasets for MNIST and variants * fix mypy * fix EMNIST labels * fix code format * avoid encoding + decoding in every step * discard data at the binary level instead of after decoding * cleanup * fix mypy Reviewed By: NicolasHug Differential Revision: D31505561 fbshipit-source-id: 7ac988ec660c9761edf51280640f318076e0ba75

* add prototype datasets for MNIST and variants * fix mypy * fix EMNIST labels * fix code format * avoid encoding + decoding in every step * discard data at the binary level instead of after decoding * cleanup * fix mypy

add prototype datasets for MNIST and variants

8941f7d

pmeier added module: datasets prototype labels Sep 30, 2021

pmeier requested a review from fmassa September 30, 2021 13:45

fix mypy

6fff66d

facebook-github-bot added the cla signed label Sep 30, 2021

pmeier added 3 commits October 5, 2021 08:06

fix EMNIST labels

68f23aa

Merge branch 'main' into datasets/mnist

f500b8f

fix code format

9f3896d

pytorch-probot bot added the ciflow/default label Oct 5, 2021

pmeier added 2 commits October 5, 2021 18:24

Merge branch 'main' into datasets/mnist

9727d90

avoid encoding + decoding in every step

a7ffeb0

pmeier requested a review from ejguan October 5, 2021 16:34

ejguan mentioned this pull request Oct 5, 2021

Adding ability to slice DataPipe pytorch/pytorch#63150

Closed

ejguan reviewed Oct 5, 2021

View reviewed changes

pmeier added 3 commits October 6, 2021 10:12

Merge branch 'main' into datasets/mnist

1814fef

discard data at the binary level instead of after decoding

47a0872

cleanup

a0b8480

pmeier requested a review from ejguan October 6, 2021 08:31

ejguan approved these changes Oct 6, 2021

View reviewed changes

pmeier added 3 commits October 7, 2021 08:30

Merge branch 'main' into datasets/mnist

2f591d6

Merge branch 'main' into datasets/mnist

fc7e88b

fix mypy

05b6eac

pmeier merged commit 261cbf7 into pytorch:main Oct 7, 2021

pmeier deleted the datasets/mnist branch October 7, 2021 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add prototype datasets for MNIST and variants #4512

add prototype datasets for MNIST and variants #4512

Uh oh!

pmeier commented Sep 30, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

ejguan left a comment

Uh oh!

ejguan Oct 5, 2021

Uh oh!

pmeier Oct 6, 2021

Uh oh!

ejguan Oct 5, 2021

Uh oh!

pmeier Oct 6, 2021

Uh oh!

pmeier Oct 6, 2021

Uh oh!

ejguan left a comment

Uh oh!

ejguan Oct 6, 2021

Uh oh!

pmeier Oct 7, 2021

Uh oh!

ejguan Oct 7, 2021

Uh oh!

pmeier Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		LZMA = "lzma"


		class Decompressor(IterDataPipe[Tuple[str, io.IOBase]]):

add prototype datasets for MNIST and variants #4512

add prototype datasets for MNIST and variants #4512

Uh oh!

Conversation

pmeier commented Sep 30, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pmeier commented Sep 30, 2021 •

edited by pytorch-probot bot

Loading