Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@erip
Copy link
Contributor

@erip erip commented Feb 5, 2022

This serves as a patch to the newly-added IWSLT2016 mock testing which addresses two issues:

  1. Starting from the downloaded archive to test the extraction and cleaning pipeline more fully
  2. Adds missed test split testing path

cc @parmeet

super().tearDownClass()

@parameterized.expand([("train", "de", "en"), ("valid", "de", "en")])
@parameterized.expand([
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: can we (should we?) parameterize by each supported lang pair?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For complete code coverage I think it would make sense to parameterize by each supported lang pair. I believe we could use the nested_params fn to generate the cartesian product of language pairs. The main issue I forsee with using that function is it would generate lang pairs that map to the same language (i.e en ->en) which I'm not sure would make sense for this dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmeet also wanted to get your thoughts on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would agree. Simulating the exact structure as the original dataset with all the possible use-cases would be the key for coverage.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this PR LGTM. @parmeet can I also get your review since you reviewed the initial IWSLT PR #1563

super().tearDownClass()

@parameterized.expand([("train", "de", "en"), ("valid", "de", "en")])
@parameterized.expand([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For complete code coverage I think it would make sense to parameterize by each supported lang pair. I believe we could use the nested_params fn to generate the cartesian product of language pairs. The main issue I forsee with using that function is it would generate lang pairs that map to the same language (i.e en ->en) which I'm not sure would make sense for this dataset.

@erip erip force-pushed the hotfix/iwslt2016-test branch from 7be1b68 to 417ee55 Compare February 7, 2022 16:30
@parmeet
Copy link
Contributor

parmeet commented Feb 7, 2022

There are test failures related to this test. @erip could you please fix them and then I can follow-up with the review. Thanks!

@erip
Copy link
Contributor Author

erip commented Feb 7, 2022

It looks like my attempt at addressing the review feedback has broken the tests and unfortunately I can't test locally because the required nightly isn't available on OS X... I'll try to address this ASAP.

@Nayef211
Copy link
Contributor

Nayef211 commented Feb 7, 2022

It looks like my attempt at addressing the review feedback has broken the tests and unfortunately I can't test locally because the required nightly isn't available on OS X... I'll try to address this ASAP.

Thanks for all your help with the testing efforts @erip! I just want to see if you would have the bandwidth to add the IWSLT2017 test as well once this PR is merged? It should be a simple matter of parameterizing this test to make it work with both IWSLT2016 and IWSLT2017.

The torchtext branch-cut is coming up this Thursday and we want to make sure all our tests are complete by this Wednesay!

@erip
Copy link
Contributor Author

erip commented Feb 7, 2022

Yes, absolutely!

@erip
Copy link
Contributor Author

erip commented Feb 7, 2022

to generate the cartesian product of language pairs

IWSLT16 doesn't support every langpair direction, so we would need to filter some pairs out. At that point, it might just make more sense to enumerate the pairs directly (with some logic for swapping the direction as needed). Thoughts?

@parmeet
Copy link
Contributor

parmeet commented Feb 7, 2022

IWSLT16 doesn't support every langpair direction, so we would need to filter some pairs out. At that point, it might just make more sense to enumerate the pairs directly (with some logic for swapping the direction as needed). Thoughts?

yupp, I agree. The total number of supported pairs are reasonable enough to enumerate explicitly.

@erip
Copy link
Contributor Author

erip commented Feb 7, 2022

@parmeet the middle ground seems to be leveraging the SUPPORTED_DATASETS attribute which gives the adjacency matrix for langpairs. I've incorporated that into testing but as I mention, the turnaround is a bit slow since I can't test locally ☹️

@Nayef211
Copy link
Contributor

Nayef211 commented Feb 8, 2022

I can't test locally because the required nightly isn't available on OS X... I'll try to address this ASAP.

@erip just wondering if you're talking about the torchtext nightly? Because we can build any of the pytorch/torchtext packages directly from source right? Also I'm happy to patch this PR to test your changes in a linux environment and give you feedback on any failures if you think that would be helpful!

@erip
Copy link
Contributor Author

erip commented Feb 8, 2022

No, it's pytorch nightly. I'm running into an error about datapipes_only not being a valid kwarg -- the same one that was causing CI errors in Windows. I have been trying to install fresh nightlies for a couple of days but no joy.

@erip
Copy link
Contributor Author

erip commented Feb 8, 2022

OK, a new nightly has been cut so I can test locally again 🥳 I have fixed one lingering issue and have a TODO to pass the right test sets... will follow a similar approach as langpair parameterization.

@erip
Copy link
Contributor Author

erip commented Feb 8, 2022

OK @parmeet @Nayef211 this is ready for review. The one thing I couldn't "figure out" was the "temp_dataset_dir" business. I think maybe that won't work here since we're looking for a very specific internal structure, but if you have thoughts about how to approach it I can try to add it.

@erip erip force-pushed the hotfix/iwslt2016-test branch from 5242134 to 91e2cf2 Compare February 8, 2022 16:00
@erip
Copy link
Contributor Author

erip commented Feb 8, 2022

I've rebased so the previously broken Multi30k should be good here. Tests are passing and life is good. 😎

@codecov
Copy link

codecov bot commented Feb 8, 2022

Codecov Report

Merging #1585 (91e2cf2) into main (da34de2) will increase coverage by 0.62%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1585      +/-   ##
==========================================
+ Coverage   80.34%   80.96%   +0.62%     
==========================================
  Files          58       58              
  Lines        2569     2569              
==========================================
+ Hits         2064     2080      +16     
+ Misses        505      489      -16     
Impacted Files Coverage Δ
torchtext/data/datasets_utils.py 64.74% <0.00%> (+5.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da34de2...91e2cf2. Read the comment docs.

root_dir, "IWSLT2016", "2016-01.tgz"
)
with tarfile.open(outer_temp_dataset_path, "w:gz") as tar:
tar.add(outer_temp_dataset_dir, arcname="2016-01")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding outer_temp_dataset_dir to the archive, we probably want to remove inner_temp_dataset_dir since the contents of this folder are already available in the inner_compressed_dataset_path, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could approach it that way -- my approach is to just re-read the cleaned files to generate the expected values. I don't know if there's a strong reason to prefer one vs. the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, we probably want to match the extraction process as if we are working with the original dataset archive. The code-base would be extracting and caching files from inner tarball sayde-en.tgz and if the files are already present in top-most archive 2016-01.tgz then we won't be executing the inner extraction (due to secondary caching logic introduced in this issue #1589). This is certainly a complex test/dataset, let me know if I am missing anything :)

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some errors I found while patching your PR locally and removing the outer temp dir. Would you be able to investigate further @erip?

"""
temp_dataset_dir = os.path.join(root_dir, f"IWSLT2016/2016-01/texts/{src}/{tgt}/{src}-{tgt}/")
os.makedirs(temp_dataset_dir, exist_ok=True)
outer_temp_dataset_dir = os.path.join(root_dir, f"IWSLT2016/2016-01/texts/{src}/{tgt}/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change this path to be "...IWSLT2016/temp_dataset_dir/2016-01/..." you will notice that there are a lot of test failures. The purpose of naming it this is to ensure that the folder we are creating the mocked data is different from where the files/folders would be place when the archive is extracted (i.e. when calling the dataset). Without doing this, the archive will never be extracted since our dataset implementation checks to see whether a file/folder exists and if it does, the archive is never extracted.

When patching this PR locally and making the above change, I noticed we're getting 212 failed, 451 passed. This is a sample error I'm getting from pytest:

FAILED test_iwslt2016.py::TestIWSLT2016::test_iwslt2016_219_train - AssertionError: 'H T G W w U w K b d\n' != 'b H K H c L t k I s\n'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understand the idea now. 👍 Let me give this a shot.

)
with tarfile.open(outer_temp_dataset_path, "w:gz") as tar:
tar.add(outer_temp_dataset_dir, arcname="2016-01")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative to what I mentioned above is to add the following lines here to delete the outer temp dataset dir and in doing so ensure the archive will always be extracted:

import shutil
shutil.rmtree(outer_temp_dataset_dir)

Doing this gives me the following results: 110 failed, 553 passed. This is a sample error I'm getting from pytest:

FAILED test_iwslt2016.py::TestIWSLT2016::test_iwslt2016_008_train - ValueError: Iterables have different lengths

@erip
Copy link
Contributor Author

erip commented Feb 8, 2022

I'm having some trouble wrapping my mind around all the various nestings again and debugging all the issues accompanying it. I also have some personal things to attend to -- I understand there's some urgency here, so I'm not going to be offended if someone else wants to work on this. I can try to look again tonight.

@parmeet
Copy link
Contributor

parmeet commented Feb 8, 2022

I'm having some trouble wrapping my mind around all the various nestings again and debugging all the issues accompanying it. I also have some personal things to attend to -- I understand there's some urgency here, so I'm not going to be offended if someone else wants to work on this. I can try to look again tonight.

Hey @erip, thanks so much for your efforts in this PR so far. Please do not bother. I know this is a complex test and since this is an improvement over what you have already contributed here #1563, I don't necessarily call it a blocker for our branch-cut. I would be happy to take it over from here, otherwise we can keep this PR and do a cherry-picking into the Release branch later. cc: @Nayef211

@Nayef211
Copy link
Contributor

Nayef211 commented Feb 9, 2022

Hey @erip, thanks so much for your efforts in this PR so far. Please do not bother. I know this is a complex test and since this is an improvement over what you have already contributed here #1563, I don't necessarily call it a blocker for our branch-cut. I would be happy to take it over from here, otherwise we can keep this PR and do a cherry-picking into the Release branch later. cc: @Nayef211

@parmeet if you have the bandwidth to take this on that would be great! We just have the test for the IWSLT2016 and IWSLT2017 datasets remaining before we can close #1493. I suspect the current caching issue should be an easy fix if you have a good understanding of the folder structure within the dataset archive.

@parmeet
Copy link
Contributor

parmeet commented Feb 9, 2022

Hey @erip, thanks so much for your efforts in this PR so far. Please do not bother. I know this is a complex test and since this is an improvement over what you have already contributed here #1563, I don't necessarily call it a blocker for our branch-cut. I would be happy to take it over from here, otherwise we can keep this PR and do a cherry-picking into the Release branch later. cc: @Nayef211

@parmeet if you have the bandwidth to take this on that would be great! We just have the test for the IWSLT2016 and IWSLT2017 datasets remaining before we can close #1493. I suspect the current caching issue should be an easy fix if you have a good understanding of the folder structure within the dataset archive.

sure, let me give it a try.

@Nayef211
Copy link
Contributor

Closing this now that #1596 is merged

@Nayef211 Nayef211 closed this Feb 10, 2022
@erip erip deleted the hotfix/iwslt2016-test branch February 10, 2022 20:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants