-
Notifications
You must be signed in to change notification settings - Fork 814
Migrating AmazonReviewPolarity to datapipes #1490
Changes from all commits
c6118e2
7e50e69
b646bac
033889b
a4669a4
e56eafb
97ee0c1
928f3a9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,13 +1,15 @@ | ||
| from torchtext._internal.module_utils import is_module_available | ||
| from typing import Union, Tuple | ||
| if is_module_available("torchdata"): | ||
| from torchdata.datapipes.iter import FileOpener, GDriveReader, IterableWrapper | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we also need to import
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good catch @Nayef211. Note that in this case we need GDriveReader. Since this is not yet implementation internally, let fix this part once we import this PR into fb-code. |
||
| from torchtext.data.datasets_utils import ( | ||
| _RawTextIterableDataset, | ||
| _wrap_split_argument, | ||
| _add_docstring_header, | ||
| _download_extract_validate, | ||
| _create_dataset_directory, | ||
| _create_data_from_csv, | ||
| ) | ||
|
|
||
| import os | ||
| import logging | ||
|
|
||
| URL = 'https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM' | ||
|
|
||
|
|
@@ -25,20 +27,24 @@ | |
| 'test': f'{os.sep}'.join(['amazon_review_polarity_csv', 'test.csv']), | ||
| } | ||
|
|
||
| _EXTRACTED_FILES_MD5 = { | ||
| 'train': "520937107c39a2d1d1f66cd410e9ed9e", | ||
| 'test': "f4c8bded2ecbde5f996b675db6228f16" | ||
| } | ||
|
|
||
| DATASET_NAME = "AmazonReviewPolarity" | ||
|
|
||
|
|
||
| @_add_docstring_header(num_lines=NUM_LINES, num_classes=2) | ||
| @_create_dataset_directory(dataset_name=DATASET_NAME) | ||
| @_wrap_split_argument(('train', 'test')) | ||
| def AmazonReviewPolarity(root, split): | ||
| path = _download_extract_validate(root, URL, MD5, os.path.join(root, _PATH), os.path.join(root, _EXTRACTED_FILES[split]), | ||
| _EXTRACTED_FILES_MD5[split], hash_type="md5") | ||
| logging.info('Creating {} data'.format(split)) | ||
| return _RawTextIterableDataset(DATASET_NAME, NUM_LINES[split], | ||
| _create_data_from_csv(path)) | ||
| @_wrap_split_argument(("train", "test")) | ||
| def AmazonReviewPolarity(root: str, split: Union[Tuple[str], str]): | ||
| # TODO Remove this after removing conditional dependency | ||
| if not is_module_available("torchdata"): | ||
| raise ModuleNotFoundError("Package `torchdata` not found. Please install following instructions at `https://github.com/pytorch/data`") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Codecov shows missing coverage here. Is there any way we could mock is_module_available and test that the exception is thrown?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think once there's a stable release of torchdata published that this check will disappear. Is it still worth the test since it'll be removed soon anyway?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, this is temporary. We can probably skip adding test.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds reasonable to me. |
||
|
|
||
| url_dp = IterableWrapper([URL]) | ||
| cache_dp = url_dp.on_disk_cache( | ||
| filepath_fn=lambda x: os.path.join(root, _PATH), hash_dict={os.path.join(root, _PATH): MD5}, hash_type="md5" | ||
| ) | ||
| cache_dp = GDriveReader(cache_dp).end_caching(mode="wb", same_filepath_fn=True) | ||
| cache_dp = FileOpener(cache_dp, mode="b") | ||
| extracted_files = cache_dp.read_from_tar() | ||
| filter_extracted_files = extracted_files.filter(lambda x: split in x[0]) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @Nayef211 for catching this. Yes, I remember this breakage earlier, and completely forgot to take care of this. Will do PR to fix it. |
||
| return filter_extracted_files.parse_csv().map(fn=lambda t: (int(t[0]), ' '.join(t[1:]))) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A n00b question. Why did we remove this and how this change relate to migration of AmazonReviewPolarity to datapipes ?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
__str__method under test here was coming from_RawTextIterableDatasetso it's somewhat overcome by events.