Steps to retire legacy code and release new building blocks in torchtext

A new abstraction has been described in 0.5.0 [release note](https://github.com/pytorch/text/releases/tag/0.5.0). Currently, we are working on retiring a few legacy codes in torchtext in the next releases. This issue will track the progress of the relevant work. Here are a few steps that users could expect:

### Step 1: Retire legacy codes in `torchtext.data` and `torchtext.datasets`
The following components will be retired from source code soon. We have added a few deprecation warning messages in 0.7.0 release ([link](https://github.com/pytorch/text/releases/tag/v0.7.0-rc3)). Users can still find them in `torchtext.legacy` and the original constructors will raise error when calling them.
- `torchtext.data.field` - RawField, Field, ReversibleField, SubwordField, NestedField, LabelField
- `torchtext.data.iterator` - BucketIterator, Iterator, BPTTIterator
- `torcthtext.data.dataset` - Dataset, TabularDataset
- `torchtext.data.example` - Example
- `torchtext.data.pipeline` - Pipeline
- `torchtext.data.batch` - Batch

At the same time, the datasets in `torchtext.datasets` are based on the legacy code above so they will be moved to the legacy folder:
- language_modeling - `LanguageModelingDataset`, `WikiText2`, `WikiText103`, `PennTreebank`
- nli - `SNLI`, `MultiNLI`, `XNLI`	
- sst - `SST`	
- translation - `TranslationDataset`, `Multi30k`, `IWSLT`, `WMT14`
- sequence_tagging - `SequenceTaggingDataset`, `UDPOS`, `CoNLL2000Chunking`
- trec - `TREC`	
- imdb - `IMDB`	
- babi - `BABI20`

### Step 2: Release the new datasets
A few legacy datasets above have been re-written and are currently available in `torchtext.experimental.datasets`. They will be released to the core library:
- language_modeling - `LanguageModelingDataset`, `WikiText2`, `WikiText103`, `PennTreebank`, `WMTNewsCrawl`
- text_classification - `AG_NEWS`, `SogouNews`, `DBpedia`, `YelpReviewPolarity`, `YelpReviewFull`, `YahooAnswers`, `AmazonReviewPolarity`, `AmazonReviewFull`, `IMDB`
- sequence_tagging - `UDPOS`, `CoNLL2000Chunking`
- translation - `Multi30k`, `IWSLT`, `WMT14`
- question_answer - `SQuAD1`, `SQuAD2`

### Step 3: Retire legacy vocab/vector and release the new data processing building blocks
We also re-written the vocabulary and word vectors as high performance building blocks with the JIT support. We will retire the following components
- `torchtext.vocab.Vocab`
- `torchtext.vocab.Vectors` along with `GloVe`, `FastText`, `CharNGram`.

After this, the new vocabulary and vector building blocks in the `experimental` folder will be moved to the core library.
- `torchtext.experimental.vectors`
- `torchtext.experimental.vocab`

We also have some transforms that will be released to the core library.
- `torchtext.experimental.transforms`

In general, we understand this is the special time for the torchtext library because we have to handle the legacy code and new building blocks at the same time. We really appreciate the efforts from the OSS community. Users should use the code in the three categories with the following expectations:
- `legacy` folder - we will accept bug fix but not new features
- `torchtext` main folder - we officially support via the stable release and carefully handle BC breaking.
- `experimental` folder - experimental components available via nightly release channel. Users might experience BC breaking without warning messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Steps to retire legacy code and release new building blocks in torchtext #985

Step 1: Retire legacy codes in `torchtext.data` and `torchtext.datasets`

Step 2: Release the new datasets

Step 3: Retire legacy vocab/vector and release the new data processing building blocks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Steps to retire legacy code and release new building blocks in torchtext #985

Description

Step 1: Retire legacy codes in torchtext.data and torchtext.datasets

Step 2: Release the new datasets

Step 3: Retire legacy vocab/vector and release the new data processing building blocks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Step 1: Retire legacy codes in `torchtext.data` and `torchtext.datasets`