Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@erip
Copy link
Contributor

@erip erip commented Feb 1, 2022

Reference #1493

Comment on lines +68 to +74
expected_samples = _get_mock_dataset(self.root_dir, split, src, tgt)

dataset = IWSLT2016(root=self.root_dir, split=split)

samples = list(dataset)

for sample, expected_sample in zip_equal(samples, expected_samples):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can we organize sampes and expected_samples similar to what we do in the SST2 PR for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's not quite as straightforward unless we want to hardcode the language pairs in setUpClass. Otherwise there's no good way to parameterize them. This is required because the expected file name for caching is a function of the (src_lang, tgt_lang, split) which is somewhat unique.

Copy link
Contributor Author

@erip erip Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do is make samples a function which accepts *args and each dataset could handle them the same way using this pattern... The order matters here though because _get_mock_dataset needs to create the temp dir before IWSLT2016 reads it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's not quite as straightforward unless we want to hardcode the language pairs in setUpClass. Otherwise there's no good way to parameterize them. This is required because the expected file name for caching is a function of the (src_lang, tgt_lang, split) which is somewhat unique.

Gotcha, I think it's okay to keep your current implementation. I didn't catch the fact that the ordering mattered here.


@parameterized.expand([("train", "de", "en"), ("valid", "de", "en")])
def test_iwslt2016(self, split, src, tgt):
expected_samples = _get_mock_dataset(self.root_dir, split, src, tgt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason why we don't want to generate all the mocked data within the setUpClass method and store it in self.samples like we do in the other tests?

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I think we can merge this once we fix the nit comment I left about variable renaming!

Comment on lines +68 to +74
expected_samples = _get_mock_dataset(self.root_dir, split, src, tgt)

dataset = IWSLT2016(root=self.root_dir, split=split)

samples = list(dataset)

for sample, expected_sample in zip_equal(samples, expected_samples):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's not quite as straightforward unless we want to hardcode the language pairs in setUpClass. Otherwise there's no good way to parameterize them. This is required because the expected file name for caching is a function of the (src_lang, tgt_lang, split) which is somewhat unique.

Gotcha, I think it's okay to keep your current implementation. I didn't catch the fact that the ordering mattered here.

@erip
Copy link
Contributor Author

erip commented Feb 2, 2022

Done, @Nayef211!

@Nayef211 Nayef211 merged commit 3ba62ca into pytorch:main Feb 2, 2022
@erip erip deleted the feature/mocked-iwslt16-test branch February 2, 2022 20:43
cls.patcher.stop()
super().tearDownClass()

@parameterized.expand([("train", "de", "en"), ("valid", "de", "en")])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IWSLT2016 also consist of test split, so ideally we should also include it in testing.

Copy link
Contributor Author

@erip erip Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes. I had made a change and forgot to re-incorporate the test split. I can cut a PR to fix in the morning.

Comment on lines +15 to +45
def _get_mock_dataset(root_dir, split, src, tgt):
"""
root_dir: directory to the mocked dataset
"""
temp_dataset_dir = os.path.join(root_dir, f"IWSLT2016/2016-01/texts/{src}/{tgt}/{src}-{tgt}/")
os.makedirs(temp_dataset_dir, exist_ok=True)

seed = 1
mocked_data = defaultdict(lambda: defaultdict(list))
valid_set = "tst2013"
test_set = "tst2014"

files_for_split, _ = _generate_iwslt_files_for_lang_and_split(16, src, tgt, valid_set, test_set)
src_file = files_for_split[src][split]
tgt_file = files_for_split[tgt][split]
for file_name in (src_file, tgt_file):
txt_file = os.path.join(temp_dataset_dir, file_name)
with open(txt_file, "w") as f:
# Get file extension (i.e., the language) without the . prefix (.en -> en)
lang = os.path.splitext(file_name)[1][1:]
for i in range(5):
rand_string = " ".join(
random.choice(string.ascii_letters) for i in range(seed)
)
dataset_line = f"{rand_string} {rand_string}\n"
# append line to correct dataset split
mocked_data[split][lang].append(dataset_line)
f.write(f'{rand_string} {rand_string}\n')
seed += 1

return list(zip(mocked_data[split][src], mocked_data[split][tgt]))
Copy link
Contributor

@parmeet parmeet Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are not creating a download archive 2016-01.tgz like we are doing for other datasets?

Edit: I think it quite important to start from the download archive, otherwise we can get into hard to find bugs specially when the compression pattern is complex like we have in here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are not creating a download archive 2016-01.tgz like we are doing for other datasets?

Edit: I think it quite important to start from the download archive, otherwise we can get into hard to find bugs specially when the compression pattern is complex like we have in here.

@erip just wanted to check if you do plan to follow-up on this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can follow up on this. It will take a lot more thought since, as you mention, the clean up is quite involved. That said, I think it should be doable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can follow up on this. It will take a lot more thought since, as you mention, the clean up is quite involved. That said, I think it should be doable.

Sure, thanks @erip!

@parmeet parmeet mentioned this pull request Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants