create T5MultiheadAttention module #1825

pmabbo13 · 2022-07-12T21:35:41Z

Description

Add T5 architecture to torchtext

Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.

The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

Testing

To test our implementation of the T5 model, we compared our outputs to the outputs of HuggingFace's T5 encoder-only and T5 encoder-decoder implementations. Testing was done in this notebook. We will update this PR once formal unit and integration tests have been added.

Stack

Stack from ghstack (oldest at bottom):

WIP PR where implementation details were discussed: #1812

[ghstack-poisoned]

Nayef211

Overall LGTM

torchtext/prototype/t5/modules.py

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

parmeet

LGTM!

Nayef211 · 2022-07-18T14:53:44Z

Let's rebase the PR stack on the latest main. A lot of the test failures you are seeing should have been resolved by PRs that have already been merged.

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

This reverts commit 20180ff.

create T5MultiheadAttention module

13f9c22

[ghstack-poisoned]

facebook-github-bot added the cla signed label Jul 12, 2022

This was referenced Jul 12, 2022

add t5 stack that can function as either the encoder or decoder of a t5 model #1828

Merged

add t5 model that can function as both encodery-only or encoder-decoder model #1829

Merged

Update on "create T5MultiheadAttention module"

9d077bd

[ghstack-poisoned]

This was referenced Jul 13, 2022

compute relative position buckets for relative attention bias #1830

Merged

computing relative attention bias #1831

Merged

computing attention scores using relative attention bias #1832

Merged

adding forward method for multihead attention #1833

Merged

Update on "create T5MultiheadAttention module"

32ad86c

[ghstack-poisoned]

pmabbo13 requested review from Nayef211, abhinavarora and parmeet July 13, 2022 18:13

Nayef211 approved these changes Jul 13, 2022

View reviewed changes

torchtext/prototype/t5/modules.py Outdated Show resolved Hide resolved

torchtext/prototype/t5/modules.py Show resolved Hide resolved

pmabbo13 mentioned this pull request Jul 14, 2022

Add T5 Model and Demo on Text Summarization using CNNDM Dataset #1800

Closed

25 tasks

parmeet approved these changes Jul 15, 2022

View reviewed changes

pmabbo13 merged commit 20180ff into gh/pmabbo13/5/base Jul 18, 2022

pmabbo13 added a commit that referenced this pull request Jul 18, 2022

Revert "create T5MultiheadAttention module (#1825)"

d9d54cc

This reverts commit 20180ff.

This was referenced Jul 18, 2022

Create T5MultiheadAttention module #1842

Closed

Add T5 Model to TorchText #1845

Merged

facebook-github-bot deleted the gh/pmabbo13/5/head branch August 18, 2022 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

create T5MultiheadAttention module #1825

create T5MultiheadAttention module #1825

Uh oh!

pmabbo13 commented Jul 12, 2022 •

edited

Loading

Uh oh!

Nayef211 left a comment

Uh oh!

Uh oh!

Uh oh!

parmeet left a comment

Uh oh!

Nayef211 commented Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

create T5MultiheadAttention module #1825

create T5MultiheadAttention module #1825

Uh oh!

Conversation

pmabbo13 commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Process

Testing

Stack

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

Nayef211 commented Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pmabbo13 commented Jul 12, 2022 •

edited

Loading