-
Notifications
You must be signed in to change notification settings - Fork 814
experimental.dataset WikiText2, WikiText103, PennTreeBank, WMTNewsCrawl #774
Conversation
…language_modeling # Conflicts: # torchtext/experimental/datasets/language_modeling.py
| self.data = data | ||
| self.vocab = vocab | ||
| self.transforms = transforms | ||
| self.data = torch.cat(tuple(transforms(row) for row in data), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with text classification datasets, please call the transform func in __getitem__ func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be hard to call the transform func in __getitem__. For dataset like WikiText2, the raw data wiki.train.tokens is stored as a multi-line txt file, and self.__getitem__[i] is expeced to output i-th token. Therefore, pre-processing is required to concat and tokenize the multi-line file. Can you advise if we need an additional preprocess func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant' move transforms into __getitem__, because tokenizer must be applied before __getitem__. Otherwise, the unit test https://github.com/pytorch/text/blob/master/test/data/test_builtin_datasets.py#L54-L57 will fail here.
Possible solution
- Move
tokenizerout oftransforms - OR: split
transformsinto two parts - global transforms / token-level transforms.
Please advise how I can move forward. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and very close to the dataset abstraction. Only made a few suggestions for changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also consolidate WMTNewsCrawl here into word language modeling datasets?
| Arguments: | ||
| root: Directory where the datasets are saved. Default: ".data" | ||
| vocab: Vocabulary used for dataset. If None, it will generate a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you change the order of vocab and tokenizer, will this be a BC breaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a possible BC breaking if people use args instead of kwargs. I restored the previous order of vocab and tokenizer in commit f433b40 above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Let me know when you are done with the revision. I will have another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed to go by tokenizer, root, vocab in order.
|
@z-Runmin Please kindly let us know if you need a review. |
|
Not sure why unittest_windows failed to install |
@peterjc123 Could you take a look at here? The master branch is green. |
| >>> vocab = build_vocab_from_iterator([['language', 'modeling']]) | ||
| >>> dataset = LanguageModelingDataset(data, vocab) | ||
| transforms: Text string transforms. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and docs for single_line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Add only two comments. Once the CI tests are fixed, we can merge the PR. Then, I will switch my BERT pipeline to the new datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And could you add a test for WMTNewsCrawl dataset? like
text/test/data/test_builtin_datasets.py
Line 79 in e709553
| def test_penntreebank(self): |
@z-Runmin @zhangguanheng66 You might need to rebase your branch on master. |
|
Looks like the wmt dataset is so large that it couldn't be downloaded within the time limit (30 min). |
OK, let me remove the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some minor changes. Will merge after CI tests pass.
Uh oh!
There was an error while loading. Please reload this page.