Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@erip
Copy link
Contributor

@erip erip commented Jan 31, 2022

Closes #1494

cc @parmeet I've had some success with this locally...

@erip erip changed the title WIP: add CC100 [WIP] add CC100 Jan 31, 2022
@erip
Copy link
Contributor Author

erip commented Jan 31, 2022

Calling this a WIP is maybe a bit conservative. It works locally, but it seems like testing it requires some additional thought because this isn't a traditional supervised dataset (so it doesn't really have a concept of splits) and thus dies here. I welcome discussion here and will leave the WIP until we've settled on something.

@abhinavarora
Copy link
Contributor

Looks like the server hosting the dataset is back up :-)

@erip
Copy link
Contributor Author

erip commented Feb 1, 2022

it doesn't really have a concept of splits

Neither does EnWiki9, so I followed that approach. 😄

@erip
Copy link
Contributor Author

erip commented Feb 1, 2022

It seems like *args don't get propagated appropriately in the split-wrapper decorator. I can give the language_code a default value (or make it a Tuple[str] to handle multiple values).

Edit: maybe for better composability, it's better to make language_code a str, so if users want multiple langs they can just instantiate multiple CC100s and compose them using native datapipe constructs.

@erip erip changed the title [WIP] add CC100 add CC100 Feb 1, 2022
@parmeet
Copy link
Contributor

parmeet commented Feb 3, 2022

Thanks @erip for taking up this one. I also have a local commit and was actually waiting for resolving the server issue first before check-in. Let's work through your PR :)

@erip
Copy link
Contributor Author

erip commented Feb 4, 2022

I've incorporated your feedback, @parmeet. It seems like there's some interference with _wrap_split_argument. I'm trying to mimic the behavior in EnWiki9 but when I give CC100 no default args, I encounter this error:

>>> from torchtext.datasets import CC100
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/erip/Code/text/torchtext/__init__.py", line 7, in <module>
    from . import datasets
  File "/Users/erip/Code/text/torchtext/datasets/__init__.py", line 6, in <module>
    from .cc100 import CC100
  File "/Users/erip/Code/text/torchtext/datasets/cc100.py", line 36, in <module>
    def CC100(root: str, split: Union[Tuple[str], str], language_code: str):
  File "/Users/erip/Code/text/torchtext/data/datasets_utils.py", line 309, in new_fn
    return _wrap_split_argument_with_fn(fn, splits)
  File "/Users/erip/Code/text/torchtext/data/datasets_utils.py", line 301, in _wrap_split_argument_with_fn
    new_sig = new_sig.replace(parameters=tuple(new_params))
  File "/Users/erip/opt/miniconda3/envs/torchtext-dev/lib/python3.8/inspect.py", line 2877, in replace
    return type(self)(parameters,
  File "/Users/erip/opt/miniconda3/envs/torchtext-dev/lib/python3.8/inspect.py", line 2810, in __init__
    raise ValueError(msg)
ValueError: non-default argument follows default argument

and when I give default values, the returned value is always a tuple (even though there's one split). Does this seem familiar?

@erip
Copy link
Contributor Author

erip commented Feb 4, 2022

Ah, I figured it out... Either everything needs a default value or only arguments after split can have a default value. This seems good now and should be consistent with other datasets. 👍

@erip erip force-pushed the feature/add-CC100-datapipes branch from 601ade1 to 0775e78 Compare February 4, 2022 13:30
Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @erip for contributing CC-100, this is the first large-scale dataset added to the repo :). Overall LGTM! As a follow-up we potentially want to figure out what to do with MD5 checks. I guess, It is not a trivial exercise to calculate MD5 for all 100+ huge files :(

@parmeet parmeet merged commit aa78b86 into pytorch:main Feb 4, 2022
@erip erip deleted the feature/add-CC100-datapipes branch February 4, 2022 17:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate datasets to build on top of torchdata datapipes

4 participants