Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

[HELP WANTED] Re-write datasets in torchtext #742

@zhangguanheng66

Description

@zhangguanheng66

As mentioned in #664, we are working on a new dataset abstraction. The new datasets will be more compatible with pytorch core library and capable of out-of-box libraries (like SentencePiece BPE).

We have landed several datasets in torchtext.experimental.datasets folder to test the new abstraction. Now, we want some help from open-source community. Please sign up here and contribute PRs to re-write those datasets in torchtext. Those datasets in torchtext/experimental/datasets/text_classification should be some good examples to follow. Some datasets but not limited include:

  • Word language modeling datasets. Already in the experimental folder but need some update to the latest abstraction.
  • IMDB
  • Text classification datasets (AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull)
  • Translation
  • Question-answer datasets (e.g. SQuAD here)
  • EnWik9, which's already good
  • Sequence tagging dataset in Experimental sequence tagging datasets #805

Here is a checklist that you may consider to migrate a dataset:

  1. Set up a IterableDataset with an iterator to read raw data link
  2. Cache out the raw dataset and create a transform pipeline link
    • pick up a tokenizer
    • generate the vocab object if not provided by users
    • attach tokenizer + vocab + totensor to the transform pipeline
  3. Add unit tests link
    • check the total length of the dataset
    • check the content of the dataset

Free free to ping me and @cpuhrsch if you want to check some ideas for this. Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions