This repository was archived by the owner on Sep 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 814
Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624
Merged
zhangguanheng66
merged 51 commits into
pytorch:master
from
zhangguanheng66:legacy_language_modeling
Nov 26, 2019
Merged
Changes from all commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
45d53de
Move PennTreebank, WikiText103, WikiText2 to torchtext.legacy
1f95483
Some initial work.
2d3ebe2
Merge branch 'master' into legacy_language_modeling
97af9d0
Re-write three datasets.
544b069
Merge branch 'master' into legacy_language_modeling
cc127de
Update tests.
97cfd05
Move legacy docs for language modeling dataset.
0ac3e18
Update docs.
56046fa
Minor debug
9962732
Update test.
ad7938e
Minor change in tests.
3ff1cce
Flake8
361f688
Merge branch 'master' into legacy_language_modeling
cc1ae4d
Move two funct to data/functional.py.
f4018cc
Fix <'unk'> compability issue.
ff329f9
Minor changes.
65c470c
Update unit tests.
96cd268
Merge branch 'master' into legacy_language_modeling
25336b9
Minor change
4819f18
Add flags for train/valid/test/
48cb0a8
Update docs.
7d70298
Add returned_dataset flag to determin subset data.
0588f1d
A small bug.
f01037d
Remove some printout.
f2ea3f1
Remove unk token.
a32712d
Use data_select.
d217294
Support a string in data_select.
cb902d4
Use torch.tensor instead of torch.Tensor
3a05197
remove duplicate code.
ac99329
Minor change in doc.
3a342c0
Change the extracted_files.
149cbc4
Docs.
6cfe9c9
get_data_path
297d1cc
Remove <unk> token.
d548bf6
Replace _data with data.
e77758e
Change create_data_from_iterator to double iter.
6d49f40
Add select_to_index.
1f60293
check subset.
8bb1cb2
Error if dataset is empty.
6a50f2a
filter output is iterable.
a29f4bd
flake8
9206e63
Add a claimer in README.rst
e2ba8bf
revise create_data_from_iterator
0993540
Remove a printout.
81055a0
Remove version num in legacy.
9dc4752
remove read_text_iterator func
367a340
Update README.
b54b883
Update the test case after not using read_text_iterator
1478d13
rename to numericalize_tokens_from_iterator
cf7c188
flake8
03dfc27
minor
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| torchtext.legacy.datasets | ||
| ==================== | ||
|
|
||
| .. currentmodule:: torchtext.legacy.datasets | ||
|
|
||
| TorchText legacy datasets. | ||
|
|
||
| All datasets are subclasses of :class:`torchtext.data.Dataset`, which | ||
| inherits from :class:`torch.utils.data.Dataset` i.e, they have ``split`` and | ||
| ``iters`` methods implemented. | ||
|
|
||
| General use cases are as follows: | ||
|
|
||
| Approach 1, ``splits``: :: | ||
|
|
||
| # set up fields | ||
| TEXT = data.Field(lower=True, include_lengths=True, batch_first=True) | ||
| LABEL = data.Field(sequential=False) | ||
|
|
||
| # make splits for data | ||
| train, test = datasets.IMDB.splits(TEXT, LABEL) | ||
|
|
||
| # build the vocabulary | ||
| TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300)) | ||
| LABEL.build_vocab(train) | ||
|
|
||
| # make iterator for splits | ||
| train_iter, test_iter = data.BucketIterator.splits( | ||
| (train, test), batch_size=3, device=0) | ||
|
|
||
| Approach 2, ``iters``: :: | ||
|
|
||
| # use default configurations | ||
| train_iter, test_iter = datasets.IMDB.iters(batch_size=4) | ||
|
|
||
| The following datasets are available: | ||
|
|
||
| .. contents:: Datasets | ||
| :local: | ||
|
|
||
|
|
||
| Language Modeling | ||
| ^^^^^^^^^^^^^^^^^ | ||
|
|
||
| Language modeling datasets are subclasses of ``LanguageModelingDataset`` class. | ||
|
|
||
| .. autoclass:: LanguageModelingDataset | ||
| :members: __init__ | ||
|
|
||
|
|
||
| WikiText-2 | ||
| ~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: WikiText2 | ||
| :members: splits, iters | ||
|
|
||
|
|
||
| WikiText103 | ||
| ~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: WikiText103 | ||
| :members: splits, iters | ||
|
|
||
|
|
||
| PennTreebank | ||
| ~~~~~~~~~~~~ | ||
|
|
||
| .. autoclass:: PennTreebank | ||
| :members: splits, iters |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One optimization here could be to yield an iterator instead of a list. This way we don't have to materialize the numbers per sentence which could be pretty large (and lists can be very slow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. That's doable. Then, we materialize the token id outside the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!