mock up IWSLT2016 test for faster testing. #1563

erip · 2022-02-01T22:47:25Z

Reference #1493

Nayef211 · 2022-02-02T14:20:39Z

test/datasets/test_iwslt2016.py

+        expected_samples = _get_mock_dataset(self.root_dir, split, src, tgt)
+
+        dataset = IWSLT2016(root=self.root_dir, split=split)
+
+        samples = list(dataset)
+
+        for sample, expected_sample in zip_equal(samples, expected_samples):


NIT: can we organize sampes and expected_samples similar to what we do in the SST2 PR for consistency

Unfortunately it's not quite as straightforward unless we want to hardcode the language pairs in setUpClass. Otherwise there's no good way to parameterize them. This is required because the expected file name for caching is a function of the (src_lang, tgt_lang, split) which is somewhat unique.

One thing we could do is make samples a function which accepts *args and each dataset could handle them the same way using this pattern... The order matters here though because _get_mock_dataset needs to create the temp dir before IWSLT2016 reads it.

Unfortunately it's not quite as straightforward unless we want to hardcode the language pairs in setUpClass. Otherwise there's no good way to parameterize them. This is required because the expected file name for caching is a function of the (src_lang, tgt_lang, split) which is somewhat unique.

Gotcha, I think it's okay to keep your current implementation. I didn't catch the fact that the ordering mattered here.

test/datasets/test_iwslt2016.py

Nayef211 · 2022-02-02T14:27:13Z

test/datasets/test_iwslt2016.py

+
+    @parameterized.expand([("train", "de", "en"), ("valid", "de", "en")])
+    def test_iwslt2016(self, split, src, tgt):
+        expected_samples = _get_mock_dataset(self.root_dir, split, src, tgt)


Is there any specific reason why we don't want to generate all the mocked data within the setUpClass method and store it in self.samples like we do in the other tests?

Nayef211

Overall LGTM. I think we can merge this once we fix the nit comment I left about variable renaming!

Nayef211 · 2022-02-02T19:28:15Z

test/datasets/test_iwslt2016.py

+        expected_samples = _get_mock_dataset(self.root_dir, split, src, tgt)
+
+        dataset = IWSLT2016(root=self.root_dir, split=split)
+
+        samples = list(dataset)
+
+        for sample, expected_sample in zip_equal(samples, expected_samples):


Unfortunately it's not quite as straightforward unless we want to hardcode the language pairs in setUpClass. Otherwise there's no good way to parameterize them. This is required because the expected file name for caching is a function of the (src_lang, tgt_lang, split) which is somewhat unique.

Gotcha, I think it's okay to keep your current implementation. I didn't catch the fact that the ordering mattered here.

erip · 2022-02-02T19:37:12Z

Done, @Nayef211!

parmeet · 2022-02-03T00:07:18Z

test/datasets/test_iwslt2016.py

+        cls.patcher.stop()
+        super().tearDownClass()
+
+    @parameterized.expand([("train", "de", "en"), ("valid", "de", "en")])


IWSLT2016 also consist of test split, so ideally we should also include it in testing.

Oops, yes. I had made a change and forgot to re-incorporate the test split. I can cut a PR to fix in the morning.

parmeet · 2022-02-03T00:10:50Z

test/datasets/test_iwslt2016.py

+def _get_mock_dataset(root_dir, split, src, tgt):
+    """
+    root_dir: directory to the mocked dataset
+    """
+    temp_dataset_dir = os.path.join(root_dir, f"IWSLT2016/2016-01/texts/{src}/{tgt}/{src}-{tgt}/")
+    os.makedirs(temp_dataset_dir, exist_ok=True)
+
+    seed = 1
+    mocked_data = defaultdict(lambda: defaultdict(list))
+    valid_set = "tst2013"
+    test_set = "tst2014"
+
+    files_for_split, _ = _generate_iwslt_files_for_lang_and_split(16, src, tgt, valid_set, test_set)
+    src_file = files_for_split[src][split]
+    tgt_file = files_for_split[tgt][split]
+    for file_name in (src_file, tgt_file):
+        txt_file = os.path.join(temp_dataset_dir, file_name)
+        with open(txt_file, "w") as f:
+            # Get file extension (i.e., the language) without the . prefix (.en -> en)
+            lang = os.path.splitext(file_name)[1][1:]
+            for i in range(5):
+                rand_string = " ".join(
+                    random.choice(string.ascii_letters) for i in range(seed)
+                )
+                dataset_line = f"{rand_string} {rand_string}\n"
+                # append line to correct dataset split
+                mocked_data[split][lang].append(dataset_line)
+                f.write(f'{rand_string} {rand_string}\n')
+                seed += 1
+
+    return list(zip(mocked_data[split][src], mocked_data[split][tgt]))


Is there a reason we are not creating a download archive 2016-01.tgz like we are doing for other datasets?

Edit: I think it quite important to start from the download archive, otherwise we can get into hard to find bugs specially when the compression pattern is complex like we have in here.

Is there a reason we are not creating a download archive 2016-01.tgz like we are doing for other datasets?

Edit: I think it quite important to start from the download archive, otherwise we can get into hard to find bugs specially when the compression pattern is complex like we have in here.

@erip just wanted to check if you do plan to follow-up on this as well?

Yes, I can follow up on this. It will take a lot more thought since, as you mention, the clean up is quite involved. That said, I think it should be doable.

Yes, I can follow up on this. It will take a lot more thought since, as you mention, the clean up is quite involved. That said, I think it should be doable.

Sure, thanks @erip!

mock up IWSLT2016 test for faster testing.

dd0aa64

pytorch-bot bot added the ciflow/default label Feb 1, 2022

facebook-github-bot added the cla signed label Feb 1, 2022

Nayef211 reviewed Feb 2, 2022

View reviewed changes

Nayef211 approved these changes Feb 2, 2022

View reviewed changes

rename variable for consistency.

1c6ff15

Nayef211 merged commit 3ba62ca into pytorch:main Feb 2, 2022

Nayef211 mentioned this pull request Feb 2, 2022

Revamp TorchText Dataset Testing Strategy #1493

Closed

27 tasks

erip deleted the feature/mocked-iwslt16-test branch February 2, 2022 20:43

parmeet reviewed Feb 3, 2022

View reviewed changes

parmeet mentioned this pull request Feb 8, 2022

Fix IWSLT2016 testing #1585

Closed

mock up IWSLT2016 test for faster testing. #1563

mock up IWSLT2016 test for faster testing. #1563

Uh oh!

Conversation

erip commented Feb 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erip Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erip commented Feb 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erip Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parmeet Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

erip Feb 2, 2022 •

edited

Loading

erip Feb 3, 2022 •

edited

Loading

parmeet Feb 3, 2022 •

edited

Loading