Skip to content

Conversation

@pmeier
Copy link
Contributor

@pmeier pmeier commented Sep 30, 2021

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have one concern for QMNIST as shown below

LZMA = "lzma"


class Decompressor(IterDataPipe[Tuple[str, io.IOBase]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding it. We may take reference from it for TorchData in the future. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may take reference from it for TorchData in the future.

Please do so for every IterDataPipe in this file. I've only put them there if I thought they are general enough to be used by multiple datasets / torchdata.

Comment on lines 389 to 396
if config.split == "test10k":
start = 0
stop = 10000
else: # config.split == "test50k"
start = 10000
stop = None

return Slicer(dp, start=start, stop=stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there way we can untackle test10k and test50k at the resource stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. The earliest would be in the MNISTFileReader that could enumerate the chunks and discard the unused chunks directly. In fact, that would improve performance since otherwise we also decode the chunks that we throw away later. I'll send a patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commits include this change

@pmeier pmeier requested a review from ejguan October 6, 2021 08:31
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment about FC for web stream

start = self.start or 0
stop = self.stop or num_samples

file.seek(start * chunk_size, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also do try-except here? Like HTTPResponse for web stream, it's not seekable, then we have to use read(bytes) to bypass it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of it. This datapipe is specifically for MNIST files, which always be seek'able. What scenario do you have in mind where we get the data from a HTTPResponse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if you want to support streaming the Dataset from remote in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about that in our sync.

@pmeier pmeier merged commit 261cbf7 into pytorch:main Oct 7, 2021
@pmeier pmeier deleted the datasets/mnist branch October 7, 2021 09:28
facebook-github-bot pushed a commit that referenced this pull request Oct 8, 2021
Summary:
* add prototype datasets for MNIST and variants

* fix mypy

* fix EMNIST labels

* fix code format

* avoid encoding + decoding in every step

* discard data at the binary level instead of after decoding

* cleanup

* fix mypy

Reviewed By: NicolasHug

Differential Revision: D31505561

fbshipit-source-id: 7ac988ec660c9761edf51280640f318076e0ba75
mszhanyi pushed a commit to mszhanyi/vision that referenced this pull request Oct 19, 2021
* add prototype datasets for MNIST and variants

* fix mypy

* fix EMNIST labels

* fix code format

* avoid encoding + decoding in every step

* discard data at the binary level instead of after decoding

* cleanup

* fix mypy
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add prototype datasets for MNIST and variants

* fix mypy

* fix EMNIST labels

* fix code format

* avoid encoding + decoding in every step

* discard data at the binary level instead of after decoding

* cleanup

* fix mypy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants