Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@Nayef211
Copy link
Contributor

Summary:

  • Removed copying partial SST2 asset file to a temp dir and instead directly working with the file from the asset folder
  • Fixed bug with path names affecting how files were filtered out from the zip file
    • For example, if the value of split is "test", the following snippet of code filter(lambda x: split in x[0]) might match all of the "train", "test", and "dev" files depending on the location of the dataset asset file
    • When testing with buck, the location of the extracted files could look something like /data/users/nayef211/fbsource/fbcode/buck-out/dev/gen/pytorch/text/test/experimental_test_datasets#binary,link-tree/test/asset/SST2/SST-2.zip/train.tsv. Since the word "test" is contained in this path string, the filtering logic would incorrectly select the "train" file even though what we want is the "test" file
    • To resolve this we append the file extension (in this case ".tsv") to the split variable in the filtering logic

Reviewed By: parmeet

Differential Revision: D32329831

fbshipit-source-id: dbb4803a04f6cd50fab3f7ce5530d3258b2db012

Summary:
- Removed copying partial SST2 asset file to a temp dir and instead directly working with the file from the asset folder
- Fixed bug with path names affecting how files were filtered out from the zip file
   - For example, if the value of `split` is "test", the following snippet of code `filter(lambda x: split in x[0])` might match all of the "train", "test", and "dev" files depending on the location of the dataset asset file
   - When testing with buck, the location of the extracted files could look something like `/data/users/nayef211/fbsource/fbcode/buck-out/dev/gen/pytorch/text/test/experimental_test_datasets#binary,link-tree/test/asset/SST2/SST-2.zip/train.tsv`. Since the word "test" is contained in this path string, the filtering logic would incorrectly select the "train" file even though what we want is the "test" file
   - To resolve this we append the file extension (in this case ".tsv") to the `split` variable in the filtering logic

Reviewed By: parmeet

Differential Revision: D32329831

fbshipit-source-id: dbb4803a04f6cd50fab3f7ce5530d3258b2db012
@Nayef211 Nayef211 marked this pull request as ready for review November 15, 2021 20:01
Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just put [fbsync] in front of the title, so that if we plan to import on per-commit basis, we know which commits to skip :). Thanks!

@Nayef211 Nayef211 changed the title Fixed file filtering bug in SST2 dataset [fbsync] Fixed file filtering bug in SST2 dataset Nov 15, 2021
@Nayef211 Nayef211 merged commit e504b81 into main Nov 15, 2021
@Nayef211 Nayef211 deleted the fbsync_main branch August 18, 2022 14:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants