Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@erip
Copy link
Contributor

@erip erip commented Jan 28, 2022

Reference #1494

This is a bit messy, but seems kosher. cc @ejguan for review

@erip
Copy link
Contributor Author

erip commented Jan 28, 2022

I plan to simplify some of the logic here tomorrow.

@erip erip force-pushed the feature/migrate-IWSLT2016-datapipes branch 2 times, most recently from 715fcde to ab7b7e9 Compare January 28, 2022 00:58
@erip
Copy link
Contributor Author

erip commented Jan 28, 2022

OK, this is significantly cleaner now. I think this is ready for review!

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it LGTM! It's just not so easy to ensure the correctness with failing CI unit-tests on datasets due to cache failure (I am looking into it why the cache is failing). Could you please ensure it's working locally as well?

@erip
Copy link
Contributor Author

erip commented Jan 28, 2022

Hmm, tests are passing but playing in the REPL is breaking. I'll fix this and push...

@erip
Copy link
Contributor Author

erip commented Jan 28, 2022

I'm getting a bit mixed up with the logic so I'm just going to brain dump here...

  • We check the cache for the outermost tgz. If we don't have it, we fetch it.

  • Next we read from the outermost tgz. In the original code we just extract the whole thing, so we want to wrap a read_from_tar call in caching. The filepath_fn here is not obvious to me...

  • Within the extraction, there's a tarfile created by iwslt_tar = os.path.join(root, os.path.splitext(_PATH)[0], "texts", src_language, tgt_language, languages) + ".tgz" which has contents we care about. In the original code this is extracted unconditionally. The filepath_fn is again not obvious here...

  • Finally, we iterate through the files in this inner tgz and clean/read as needed. This logic after this is fine, I think...

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this migration @erip. This dataset is definitely more complex than the other's that we've worked on so far. Just left a couple of comments from my side but lmk if you agree with the suggestions 😄

@parmeet
Copy link
Contributor

parmeet commented Jan 28, 2022

I'm getting a bit mixed up with the logic so I'm just going to brain dump here...

Thanks @erip for this exercise. This is certainly not as straightforward as other datasets. My understanding about file_path fn in the context of caching is that it would look for all the files returned by that function and return True/False depending on whether they exist on filesystem or not (here is the code: https://github.com/pytorch/data/blob/160ce809f8c3ee686af0e3e701d7341827841eeb/torchdata/datapipes/iter/util/cacheholder.py#L182)

Based on this, I would like to think about the steps as follows:

  1. Download outer tar (with caching mechanism as usual)
  2. Extract outer tar
    2.1) We would do this extraction only if the inner tar (let say en-de.tgz) we care about doesn't exist in the file system. So basically filepath_fn would return this path
  3. Filter the files from step 2 such that we only get en-de.tgz.
    3.1) Apply the same logic for caching
  4. Filter the files we care about and apply the clean-up function
    4.1) Again we can check whether cleaned-up file exist on the filesystem or not using caching mechanism, and if not we apply the clean-up function

So in a way, there are 4 levels of caching, 1) Download, 2) Outer extraction, 3) inner extraction, 4) clean-up. I think caching is certainly quite involved after 1 , so i would suggest to break this PR into two. In this PR we only introduce caching for download as usual. Then we introduce the caching mechanism for other parts as follow-up PR. This would also make it bit easier to review. Let me know if this helps?

cc: @ejguan, to make sure I didn't misguided on the caching story :)

@erip
Copy link
Contributor Author

erip commented Jan 28, 2022

@parmeet that's helpful. I have things nearly good and am on the cleaning step in the last caching block. The issue currently is that when clean_files gets the filename, it's the name of the file within the tar (with its associated stream also available). This requires some ugly path-hacking to create the corresponding on-disk file in the map to emulate the side-effectful aspect of the write in the original code, but it should be doable.

@erip
Copy link
Contributor Author

erip commented Jan 28, 2022

OK, this looks better now.

image

@erip erip force-pushed the feature/migrate-IWSLT2016-datapipes branch from f689558 to 9381a88 Compare January 28, 2022 20:36
@erip erip requested review from Nayef211, ejguan and parmeet January 28, 2022 20:41
@erip erip force-pushed the feature/migrate-IWSLT2016-datapipes branch from a88a99e to 60f0b80 Compare January 29, 2022 21:10
@erip
Copy link
Contributor Author

erip commented Jan 29, 2022

OK, I'm happier with this. 😸

@parmeet
Copy link
Contributor

parmeet commented Jan 29, 2022

OK, I'm happier with this. 😸

woww, this looks great. Thanks so much @erip for reading well from my half-baked comments and working so hard on this PR :). This certainly is the most challenging so far. I think we should be good to go with what we have now!

Edit: The speed-ups looks great and we are extracting what's necessary hence saving the disk space!!

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's merge it once we address the minor comment here #1545 (comment)

@erip
Copy link
Contributor Author

erip commented Jan 30, 2022

Done, @parmeet! :shipit:

@parmeet parmeet merged commit c10d7ef into pytorch:main Jan 30, 2022
@erip erip deleted the feature/migrate-IWSLT2016-datapipes branch January 30, 2022 12:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants