Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Add T5 Model and Demo on Text Summarization using CNNDM Dataset #1800

@pmabbo13

Description

@pmabbo13

🚀 Feature

Add CNNDM dataset and a pre-trained T5 model to TorchText. Demo model on task of abstractive summarization using the CNNDM dataset.

Motivation

There are multiple frameworks out in OSS that cater to a wide variety of audiences. As a result of this fragmentation, a typical NLP researcher usually writes their code in pure PyTorch while copying essential components from other repositories. Adding a pre-trained T5 model and CNNDM dataset increases the convenience of using the TorchText library and works towards making PyTorch the most preferred deep learning framework for NLP research.

T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with text as input and modified text as output. This text-to-text formatting makes the T5 model fit for multiple NLP tasks like Summarization, Question-Answering, Machine Translation, and Classification problems. CNNDM (CNN/DailyMail) is also a popular dataset in the NLP community used for text summarization tasks.

Pitch

The T5 model architecture will be implemented to allow for initialization using hyper-parameters such as the number of layers, hidden size, attention size, etc. The user should also be able to specify whether they wish to access the Encoder-only model (for non-text generation tasks) or the Encoder-Decoder model. To load the pre-trained weights, Google has released 5 checkpoints for the different sized T5 models, so these checkpoint weights will be added to PyTorch.org and an API will be implemented to load these checkpoints. Finally, integration tests will be added for both the Encoder-only and Encoder-Decoder model APIs.

The CNNDM dataset will also be made available in the TorchText library. This will allows us to demo the pre-trained T5 model by using it to perform abstract summarization on the CNNDM dataset. A text pre-processing pipeline will need to be implemented in order to prep the data for the model.

Milestone 1: Add CNNDM dataset

Milestone 2: Implement T5 model architecture

Milestone 3: Demo T5 model on text summarization

Stretch Goals

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions