migrate IWSLT2016 to datapipes. #1545

erip · 2022-01-28T00:12:49Z

Reference #1494

This is a bit messy, but seems kosher. cc @ejguan for review

erip · 2022-01-28T00:17:27Z

I plan to simplify some of the logic here tomorrow.

erip · 2022-01-28T13:35:16Z

OK, this is significantly cleaner now. I think this is ready for review!

parmeet

Overall it LGTM! It's just not so easy to ensure the correctness with failing CI unit-tests on datasets due to cache failure (I am looking into it why the cache is failing). Could you please ensure it's working locally as well?

torchtext/datasets/iwslt2016.py

erip · 2022-01-28T14:52:12Z

Hmm, tests are passing but playing in the REPL is breaking. I'll fix this and push...

erip · 2022-01-28T15:47:42Z

I'm getting a bit mixed up with the logic so I'm just going to brain dump here...

We check the cache for the outermost tgz. If we don't have it, we fetch it.
Next we read from the outermost tgz. In the original code we just extract the whole thing, so we want to wrap a read_from_tar call in caching. The filepath_fn here is not obvious to me...
Within the extraction, there's a tarfile created by iwslt_tar = os.path.join(root, os.path.splitext(_PATH)[0], "texts", src_language, tgt_language, languages) + ".tgz" which has contents we care about. In the original code this is extracted unconditionally. The filepath_fn is again not obvious here...
Finally, we iterate through the files in this inner tgz and clean/read as needed. This logic after this is fine, I think...

Nayef211

Thanks for taking on this migration @erip. This dataset is definitely more complex than the other's that we've worked on so far. Just left a couple of comments from my side but lmk if you agree with the suggestions 😄

torchtext/datasets/iwslt2016.py

parmeet · 2022-01-28T16:16:55Z

I'm getting a bit mixed up with the logic so I'm just going to brain dump here...

Thanks @erip for this exercise. This is certainly not as straightforward as other datasets. My understanding about file_path fn in the context of caching is that it would look for all the files returned by that function and return True/False depending on whether they exist on filesystem or not (here is the code: https://github.com/pytorch/data/blob/160ce809f8c3ee686af0e3e701d7341827841eeb/torchdata/datapipes/iter/util/cacheholder.py#L182)

Based on this, I would like to think about the steps as follows:

Download outer tar (with caching mechanism as usual)
Extract outer tar
2.1) We would do this extraction only if the inner tar (let say en-de.tgz) we care about doesn't exist in the file system. So basically filepath_fn would return this path
Filter the files from step 2 such that we only get en-de.tgz.
3.1) Apply the same logic for caching
Filter the files we care about and apply the clean-up function
4.1) Again we can check whether cleaned-up file exist on the filesystem or not using caching mechanism, and if not we apply the clean-up function

So in a way, there are 4 levels of caching, 1) Download, 2) Outer extraction, 3) inner extraction, 4) clean-up. I think caching is certainly quite involved after 1 , so i would suggest to break this PR into two. In this PR we only introduce caching for download as usual. Then we introduce the caching mechanism for other parts as follow-up PR. This would also make it bit easier to review. Let me know if this helps?

cc: @ejguan, to make sure I didn't misguided on the caching story :)

torchtext/datasets/iwslt2016.py

erip · 2022-01-28T17:47:53Z

@parmeet that's helpful. I have things nearly good and am on the cleaning step in the last caching block. The issue currently is that when clean_files gets the filename, it's the name of the file within the tar (with its associated stream also available). This requires some ugly path-hacking to create the corresponding on-disk file in the map to emulate the side-effectful aspect of the write in the original code, but it should be doable.

erip · 2022-01-28T20:29:04Z

OK, this looks better now.

torchtext/datasets/iwslt2016.py

torchtext/data/datasets_utils.py

torchtext/datasets/iwslt2016.py

…ed file.

erip · 2022-01-29T22:53:51Z

OK, I'm happier with this. 😸

torchtext/data/datasets_utils.py

parmeet · 2022-01-29T23:46:49Z

OK, I'm happier with this. 😸

woww, this looks great. Thanks so much @erip for reading well from my half-baked comments and working so hard on this PR :). This certainly is the most challenging so far. I think we should be good to go with what we have now!

Edit: The speed-ups looks great and we are extracting what's necessary hence saving the disk space!!

parmeet

LGTM! Let's merge it once we address the minor comment here #1545 (comment)

erip · 2022-01-30T04:20:29Z

Done, @parmeet!

pytorch-bot bot added the ciflow/default label Jan 28, 2022

facebook-github-bot added the cla signed label Jan 28, 2022

erip force-pushed the feature/migrate-IWSLT2016-datapipes branch 2 times, most recently from 715fcde to ab7b7e9 Compare January 28, 2022 00:58

parmeet reviewed Jan 28, 2022

View reviewed changes

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

Nayef211 reviewed Jan 28, 2022

View reviewed changes

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

ejguan reviewed Jan 28, 2022

View reviewed changes

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

torchtext/datasets/iwslt2016.py Outdated Show resolved Hide resolved

Nayef211 mentioned this pull request Jan 28, 2022

Migrate datasets to build on top of torchdata datapipes #1494

Closed

22 tasks

erip added 6 commits January 28, 2022 15:36

migrate IWSLT2016 to datapipes.

857d82d

try to fix style.

b4507be

clean up logic

b31bf25

add inner-tar cleaners.

c565d1b

simplify logic for IWSLT2016.

b044913

add missing base path to output file so files are not written to cwd

9381a88

erip force-pushed the feature/migrate-IWSLT2016-datapipes branch from f689558 to 9381a88 Compare January 28, 2022 20:36

refactor for consistency.

c04f6d6

erip requested review from Nayef211, ejguan and parmeet January 28, 2022 20:41

parmeet reviewed Jan 29, 2022

View reviewed changes

erip added 3 commits January 29, 2022 13:09

address initial style reviews.

77839ce

pull out common absolute paths for filtering src/tgt files.

a2dc38e

add docstring for new functions.

b668624

refactors some of the caching logic and cleaners

60f0b80

erip force-pushed the feature/migrate-IWSLT2016-datapipes branch from a88a99e to 60f0b80 Compare January 29, 2022 21:10

erip added 2 commits January 29, 2022 16:20

fix caching by returning the new StreamWrapper for newly-created cach…

9b25db0

…ed file.

no need to re-decode when we can read in text mode. :-)

f452091

DRY up the inner caching logic.

700ff18

parmeet reviewed Jan 29, 2022

View reviewed changes

torchtext/data/datasets_utils.py Outdated Show resolved Hide resolved

parmeet approved these changes Jan 30, 2022

View reviewed changes

fix docstring.

60f863a

parmeet merged commit c10d7ef into pytorch:main Jan 30, 2022

erip deleted the feature/migrate-IWSLT2016-datapipes branch January 30, 2022 12:39

migrate IWSLT2016 to datapipes. #1545

migrate IWSLT2016 to datapipes. #1545

Uh oh!

Conversation

erip commented Jan 28, 2022

Uh oh!

erip commented Jan 28, 2022

Uh oh!

erip commented Jan 28, 2022

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erip commented Jan 28, 2022

Uh oh!

erip commented Jan 28, 2022

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parmeet commented Jan 28, 2022

Uh oh!

Uh oh!

Uh oh!

erip commented Jan 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erip commented Jan 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erip commented Jan 29, 2022

Uh oh!

Uh oh!

parmeet commented Jan 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

erip commented Jan 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

erip commented Jan 28, 2022 •

edited

Loading

parmeet commented Jan 29, 2022 •

edited

Loading