Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@pmabbo13
Copy link
Contributor

Original PR can be found here: #1825

pmabbo13 and others added 9 commits July 12, 2022 17:35
# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
@pmabbo13 pmabbo13 requested a review from Nayef211 July 18, 2022 21:53
@pmabbo13 pmabbo13 closed this Jul 18, 2022
@facebook-github-bot facebook-github-bot deleted the gh/pmabbo13/5/base branch August 18, 2022 14:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants