-
Notifications
You must be signed in to change notification settings - Fork 814
add CC100 #1562
add CC100 #1562
Conversation
|
Calling this a WIP is maybe a bit conservative. It works locally, but it seems like testing it requires some additional thought because this isn't a traditional supervised dataset (so it doesn't really have a concept of splits) and thus dies here. I welcome discussion here and will leave the WIP until we've settled on something. |
|
Looks like the server hosting the dataset is back up :-) |
Neither does EnWiki9, so I followed that approach. 😄 |
|
It seems like Edit: maybe for better composability, it's better to make |
|
Thanks @erip for taking up this one. I also have a local commit and was actually waiting for resolving the server issue first before check-in. Let's work through your PR :) |
|
I've incorporated your feedback, @parmeet. It seems like there's some interference with >>> from torchtext.datasets import CC100
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/erip/Code/text/torchtext/__init__.py", line 7, in <module>
from . import datasets
File "/Users/erip/Code/text/torchtext/datasets/__init__.py", line 6, in <module>
from .cc100 import CC100
File "/Users/erip/Code/text/torchtext/datasets/cc100.py", line 36, in <module>
def CC100(root: str, split: Union[Tuple[str], str], language_code: str):
File "/Users/erip/Code/text/torchtext/data/datasets_utils.py", line 309, in new_fn
return _wrap_split_argument_with_fn(fn, splits)
File "/Users/erip/Code/text/torchtext/data/datasets_utils.py", line 301, in _wrap_split_argument_with_fn
new_sig = new_sig.replace(parameters=tuple(new_params))
File "/Users/erip/opt/miniconda3/envs/torchtext-dev/lib/python3.8/inspect.py", line 2877, in replace
return type(self)(parameters,
File "/Users/erip/opt/miniconda3/envs/torchtext-dev/lib/python3.8/inspect.py", line 2810, in __init__
raise ValueError(msg)
ValueError: non-default argument follows default argumentand when I give default values, the returned value is always a tuple (even though there's one split). Does this seem familiar? |
|
Ah, I figured it out... Either everything needs a default value or only arguments after |
601ade1 to
0775e78
Compare
parmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @erip for contributing CC-100, this is the first large-scale dataset added to the repo :). Overall LGTM! As a follow-up we potentially want to figure out what to do with MD5 checks. I guess, It is not a trivial exercise to calculate MD5 for all 100+ huge files :(
Closes #1494
cc @parmeet I've had some success with this locally...