Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Steps to retire legacy code and release new building blocks in torchtext #985

@zhangguanheng66

Description

@zhangguanheng66

A new abstraction has been described in 0.5.0 release note. Currently, we are working on retiring a few legacy codes in torchtext in the next releases. This issue will track the progress of the relevant work. Here are a few steps that users could expect:

Step 1: Retire legacy codes in torchtext.data and torchtext.datasets

The following components will be retired from source code soon. We have added a few deprecation warning messages in 0.7.0 release (link). Users can still find them in torchtext.legacy and the original constructors will raise error when calling them.

  • torchtext.data.field - RawField, Field, ReversibleField, SubwordField, NestedField, LabelField
  • torchtext.data.iterator - BucketIterator, Iterator, BPTTIterator
  • torcthtext.data.dataset - Dataset, TabularDataset
  • torchtext.data.example - Example
  • torchtext.data.pipeline - Pipeline
  • torchtext.data.batch - Batch

At the same time, the datasets in torchtext.datasets are based on the legacy code above so they will be moved to the legacy folder:

  • language_modeling - LanguageModelingDataset, WikiText2, WikiText103, PennTreebank
  • nli - SNLI, MultiNLI, XNLI
  • sst - SST
  • translation - TranslationDataset, Multi30k, IWSLT, WMT14
  • sequence_tagging - SequenceTaggingDataset, UDPOS, CoNLL2000Chunking
  • trec - TREC
  • imdb - IMDB
  • babi - BABI20

Step 2: Release the new datasets

A few legacy datasets above have been re-written and are currently available in torchtext.experimental.datasets. They will be released to the core library:

  • language_modeling - LanguageModelingDataset, WikiText2, WikiText103, PennTreebank, WMTNewsCrawl
  • text_classification - AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • sequence_tagging - UDPOS, CoNLL2000Chunking
  • translation - Multi30k, IWSLT, WMT14
  • question_answer - SQuAD1, SQuAD2

Step 3: Retire legacy vocab/vector and release the new data processing building blocks

We also re-written the vocabulary and word vectors as high performance building blocks with the JIT support. We will retire the following components

  • torchtext.vocab.Vocab
  • torchtext.vocab.Vectors along with GloVe, FastText, CharNGram.

After this, the new vocabulary and vector building blocks in the experimental folder will be moved to the core library.

  • torchtext.experimental.vectors
  • torchtext.experimental.vocab

We also have some transforms that will be released to the core library.

  • torchtext.experimental.transforms

In general, we understand this is the special time for the torchtext library because we have to handle the legacy code and new building blocks at the same time. We really appreciate the efforts from the OSS community. Users should use the code in the three categories with the following expectations:

  • legacy folder - we will accept bug fix but not new features
  • torchtext main folder - we officially support via the stable release and carefully handle BC breaking.
  • experimental folder - experimental components available via nightly release channel. Users might experience BC breaking without warning messages.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions