Create T5MultiheadAttention module #1842

pmabbo13 · 2022-07-18T21:37:52Z

Original PR can be found here: #1825

[ghstack-poisoned]

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

pmabbo13 and others added 9 commits July 12, 2022 17:35

compute relative position buckets for relative attention bias

70dad25

[ghstack-poisoned]

compute relative position bias for t5 attention

88743d6

[ghstack-poisoned]

compute attention scores for t5 model using relative attention bias

7b67d56

[ghstack-poisoned]

perform multihead attention using relative attention bias for t5 model

f3fac0e

[ghstack-poisoned]

Update base for Update on "create T5MultiheadAttention module"

021bbf5

[ghstack-poisoned]

Update base for Update on "create T5MultiheadAttention module"

1bac296

[ghstack-poisoned]

create T5MultiheadAttention module (#1825)

20180ff

facebook-github-bot added the cla signed label Jul 18, 2022

pmabbo13 requested a review from Nayef211 July 18, 2022 21:53

pmabbo13 closed this Jul 18, 2022

facebook-github-bot deleted the gh/pmabbo13/5/base branch August 18, 2022 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create T5MultiheadAttention module #1842

Create T5MultiheadAttention module #1842

Uh oh!

pmabbo13 commented Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Create T5MultiheadAttention module #1842

Create T5MultiheadAttention module #1842

Uh oh!

Conversation

pmabbo13 commented Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants