Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@pmabbo13
Copy link
Contributor

@pmabbo13 pmabbo13 commented Jul 12, 2022

Description

Add T5 architecture to torchtext

Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.

  • The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
  • T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
  • T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

Testing

To test our implementation of the T5 model, we compared our outputs to the outputs of HuggingFace's T5 encoder-only and T5 encoder-decoder implementations. Testing was done in this notebook. We will update this PR once formal unit and integration tests have been added.

Stack

Stack from ghstack (oldest at bottom):

WIP PR where implementation details were discussed: #1812

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Nayef211
Copy link
Contributor

Let's rebase the PR stack on the latest main. A lot of the test failures you are seeing should have been resolved by PRs that have already been merged.

# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
@pmabbo13 pmabbo13 merged commit 20180ff into gh/pmabbo13/5/base Jul 18, 2022
pmabbo13 added a commit that referenced this pull request Jul 18, 2022
@facebook-github-bot facebook-github-bot deleted the gh/pmabbo13/5/head branch August 18, 2022 14:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants