[WIP] Add T5 Model to TorchText #1812

pmabbo13 · 2022-06-29T16:06:44Z

Description

Add T5 architecture to torchtext

Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5EncoderLayer, T5DecoderLayer, T5Encoder, T5Decoder, and T5 modules, all of which can inherit from torch.nn.TransformerEncoderLayer, torch.nn.TransformerDecoderLayer, etc.

Testing

not yet implemented

pmabbo13 · 2022-07-06T14:48:25Z

torchtext/prototype/t5/modules.py

+    relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
+    return relative_buckets
+
+


_compute_bias() is same as HF implementation, except we explicitly pass in parameters relative_attention_bias, relative_attention_num_buckets, relative_attention_max_distance, and bidirectional

pmabbo13 · 2022-07-06T14:50:32Z

torchtext/prototype/t5/modules.py

+
+import torch
+
+


_relative_position_bucket() is identical to HF implementation

@parmeet @mthrok do you know the guidance for citing code that is taken from the HF library? Looks like they have a Apache 2.0 licence

The general guideline is to retain license headers and license description.
But one should consult https://fburl.com/y41qjtzl to be sure.

I saw this as one way of citing in Pytorch core. Maybe we could do something similar.

I consulted internally, since I have to do similar for BERT Tokenizer implementation. In summary, we need to include modified header in source code. You can refer to header in BERT tokenizer CPP file.

pmabbo13 · 2022-07-06T14:59:32Z

torchtext/prototype/t5/modules.py

    values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
    return values
+
+


_t5_scaled_dot_product_attention() is modified from pytorch implementation to incorporate relative attention bias when computing the attention scores. position_bias is passed in as a parameter (line 90) and is added to the attention score computation in lines 119-124, 129, 133

pmabbo13 · 2022-07-06T15:05:27Z

torchtext/prototype/t5/modules.py

        else:
            return attn_output, attn_output_weights, position_bias
+
+


T5 model uses a root-mean-square layer normalization. T5LayerNorm implements this and is taken directly from HF

pmabbo13 · 2022-07-06T15:18:01Z

torchtext/prototype/t5/modules.py

+    output = torch.bmm(attn, v)
+    return output, attn
+
+


t5_multi_head_attention_forward() is modified from its pytorch implementation to incorporate relative attention bias. We've added parameters compute_relative_attention_bias, relative_attention_bias, relative_attention_num_buckets, relative_attention_max_distance, and position_bias which are used in lines 432-444, 449. This implementation was inspired by a similar HF implementation

Is it absolutely necessary that this implementation is a plain function, instead of a method in T5 class?

I see that many attributes from the model instance is being passed, which makes me wonder what is the value-add in T5 model class implementation.

PyTorch's implementation (which was originated from torchtext) is trying to cover many use cases, so the signature is complex, where as this implementation is only required to serve the needs of T5 model, can't it be simplified?

pmabbo13 · 2022-07-06T15:31:40Z

torchtext/prototype/t5/modules.py

+
+        return attn_output, None, position_bias
+
+


T5MultiheadAttention inherits from pytorch's MultiheadAttention. The forward method is a trimmed down version of theirs that includes necessary parameters to incorporate relative attention bias. Pytorch's implementation includes some torchscript optimizations (lines 1069-1126) that I wasn't too sure about so I omitted for now.

pmabbo13 · 2022-07-06T15:45:53Z

torchtext/prototype/t5/modules.py

+
+        return self.weight * hidden_states
+
+


T5EncoderLayer inherits from pytorch's TransformerEncoderLayer. It is initialized to incorporate relative attention bias parameters, use T5MultiheadAttention and T5LayerNorm, and remove the bias from linear layers (lines 627-636). The forward method is also taken from the pytorch implementation and modified to incorporate position bias (lines 656-694). The pytorch method included some torchscript optimizations (lines 410-457) that I omitted for now.

pmabbo13 · 2022-07-06T15:59:02Z

torchtext/prototype/t5/modules.py

+        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
+        return self.dropout2(x)
+
+


T5Encoder inherits from pytorch's TransformerEncoder. The initialization is mostly the same, except that we only compute relative attention bias in the first layer, and the resulting tensor is passed up to the higher layers to avoid re-computing. Lines 732-735 incorporate this distinction. The forward method is a trimmed down version of the pytorch implementation's, which included torchscript optimizations (lines 206-232) that I've omitted for now. They've also implemented something called convert_to_nested which I'm not too sure of its purpose.

pmabbo13 · 2022-07-06T16:01:32Z

torchtext/prototype/t5/modules.py

+
+        return output
+
+


T5EncoderModel is the complete encoder model that includes the word embeddings, and the final layer normalizations and dropouts

Nayef211

Overall the implementation looks fine to me in terms of class separation and design. To verify correctness, let's compare the outputs of the HF implementation with ours as we discussed. We can take a closer look at the components once we break up this PR into several smaller ones after we've verified correctness of the model!

@abhinavarora it would be useful to get your feedback on whether all of the changes from the T5 paper are represented correctly in the code.

Nayef211 · 2022-07-06T19:16:06Z

torchtext/prototype/t5/modules.py

+
+import torch
+
+


@parmeet @mthrok do you know the guidance for citing code that is taken from the HF library? Looks like they have a Apache 2.0 licence

Nayef211 · 2022-07-06T20:09:38Z

torchtext/prototype/t5/modules.py

+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        bias=False,
+        add_bias_kv=False,
+        add_zero_attn=False,
+        kdim=None,
+        vdim=None,
+        batch_first=False,
+        device=None,
+        dtype=None,


When inheriting from a Parent class, I would make use of *args and **kwargs to pass in multiple arguments and keyword arguments to your parent class. In the T5MultiheadAttention class, you don't make use of any of the input args during initialization so using *args and **kwargs and passing them to the Parent class for initialization would be less verbose.

@mthrok wanted to double check if you guys follow this practice in torchaudio?

I think it depends on the context. There are pros and cons for both approach. If the signature of the constructor is a public API, then explicitly listing could be helpful when documenting. (at the same time, it could appear cluttered) Even if it is not public, if it is expected that the signature can change many time then, explicit is better.

Assuming that this class is just a component of the T5 model class, won't change, and not intended for direct use, the use of args and kwargs will make the code shorter, and I think that is reasonable.

I agree with @mthrok with regard to usage. I think in general, it wouldn't hurt to keep the arguments explicit :) (Imaging some enthusiastic user who quickly want to try out this new T5 style MHA in their work, it would be bit easier, of-course it is at their risk since we haven't yet made the API public).

Nayef211 · 2022-07-06T20:11:13Z

torchtext/prototype/t5/modules.py

+class T5EncoderLayer(nn.TransformerEncoderLayer):
+    def __init__(
+        self,
+        d_model: int = 768,
+        nhead: int = 12,
+        dim_feedforward: int = 3072,
+        dropout: float = 0.1,
+        activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
+        layer_norm_eps: float = 1e-6,
+        batch_first: bool = False,
+        norm_first: bool = True,
+        compute_relative_attention_bias: bool = False,
+        relative_attention_num_buckets: int = 32,
+        relative_attention_max_distance: int = 128,
+        relative_attention_bias: Optional[Tensor] = None,
+        device=None,
+        dtype=None,


Same comment as above about using *args and **kwargs

Nayef211 · 2022-07-06T20:18:52Z

torchtext/prototype/t5/modules.py

+    def __init__(
+        self,
+        d_model: int,
+        d_feedforward: int,
+        dropout: float,
+        activation: Union[str, Callable[[Tensor], Tensor]],
+        layer_norm_eps: float,
+        num_heads: int,
+        num_layers: int,
+        batch_first: bool,
+        relative_attention_num_buckets: int,
+        relative_attention_max_distance: int,
+        padding_idx: int,
+        max_seq_len: int,
+        vocab_size: int,
+    ) -> None:
+        super().__init__()


nit: Let's add docstrings explaining the purpose of the class and all of the input arguments before we actually merge in the implementation

Nayef211 · 2022-07-06T20:20:05Z

torchtext/prototype/t5/modules.py

+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, tokens: torch.Tensor):
+


nit: remove new line

mthrok · 2022-07-07T01:19:34Z

torchtext/prototype/t5/modules.py

+        dtype=None,
+    ) -> None:
+
+        super(T5MultiheadAttention, self).__init__(


super(<CLASS_NAME>, self) is the syntax from Python 2 era. It can be as simple as

Suggested change

super(T5MultiheadAttention, self).__init__(

super().__init__(

mthrok · 2022-07-07T01:30:30Z

torchtext/prototype/t5/modules.py

+    ):
+        super(T5Encoder, self).__init__(encoder_layer, num_layers, norm, enable_nested_tensor)
+
+        first_layer = copy.deepcopy(encoder_layer)


I am not sure if the use of deep copy is guaranteed to keep working with PyTorch modules.

Instead of expecting an instance, it is more elegant to accept parameters required to build the component and instantiate encoders as much as needed.

abhinavarora · 2022-07-08T17:30:42Z

torchtext/prototype/t5/modules.py

+
+
+# NOTE: taken from HF; used to compute relative attention bias
+def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):


Let's add type annotations for consistency?

abhinavarora · 2022-07-08T17:30:50Z

torchtext/prototype/t5/modules.py

+
+
+# NOTE: modified from HF; used to compute relative attention bias
+def _compute_bias(


Let's add type annotations for consistency?

parmeet · 2022-07-11T13:59:25Z

torchtext/prototype/t5/modules.py

+
+import torch
+
+


I consulted internally, since I have to do similar for BERT Tokenizer implementation. In summary, we need to include modified header in source code. You can refer to header in BERT tokenizer CPP file.

parmeet · 2022-07-11T14:04:09Z

torchtext/prototype/t5/modules.py

+
+
+# NOTE: modified from torch.nn.functional._scaled_dot_product_attention to incorporate relative attention bias
+def _t5_scaled_dot_product_attention(


@pmabbo13 for easier follow-up, could you please add comment directly in the source code that is modified to incorporate relative attention bias?

parmeet · 2022-07-11T14:19:25Z

torchtext/prototype/t5/modules.py

+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        bias=False,
+        add_bias_kv=False,
+        add_zero_attn=False,
+        kdim=None,
+        vdim=None,
+        batch_first=False,
+        device=None,
+        dtype=None,


I agree with @mthrok with regard to usage. I think in general, it wouldn't hurt to keep the arguments explicit :) (Imaging some enthusiastic user who quickly want to try out this new T5 style MHA in their work, it would be bit easier, of-course it is at their risk since we haven't yet made the API public).

parmeet · 2022-07-11T14:20:32Z

torchtext/prototype/t5/modules.py

+            return attn_output, attn_output_weights, position_bias
+
+
+# NOTE: Taken from HF


Let's also provide the link of the source code.

parmeet · 2022-07-11T14:21:14Z

torchtext/prototype/t5/modules.py

+    return relative_buckets
+
+
+# NOTE: modified from HF; used to compute relative attention bias


Let's also provide the link to original source code in comment.

parmeet · 2022-07-11T14:21:33Z

torchtext/prototype/t5/modules.py

+from torch import Tensor
+
+
+# NOTE: taken from HF; used to compute relative attention bias


Let's also provide the link to source code as comment

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

WIP PR to workshop implementation: #1812 [ghstack-poisoned]

…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]

WIP PR to workshop implementation: #1812 [ghstack-poisoned]

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]

WIP PR to workshop implementation: #1812 [ghstack-poisoned]

# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]

add relative attention bias implementation from HF

e0076d3

facebook-github-bot added the cla signed label Jun 29, 2022

pmabbo13 added 4 commits June 29, 2022 12:10

incoporate relative attention bias in attention computation

9dbef30

incoporate relative attention bias in MultiHeadAttention module

e5158a6

add t5 layer normalization module

e9cef16

outline t5 encoder layer

faf32cc

pmabbo13 mentioned this pull request Jun 29, 2022

Add T5 Model and Demo on Text Summarization using CNNDM Dataset #1800

Closed

25 tasks

pmabbo13 added 2 commits June 29, 2022 17:30

implement t5 encoder layer

32a43d3

implement t5 encoder

5a0f7bd

pmabbo13 commented Jul 6, 2022

View reviewed changes

Nayef211 reviewed Jul 6, 2022

View reviewed changes

mthrok reviewed Jul 7, 2022

View reviewed changes

return hidden states from each layer of encoder

2773830

abhinavarora reviewed Jul 8, 2022

View reviewed changes

parmeet reviewed Jul 11, 2022

View reviewed changes

pmabbo13 closed this Jul 12, 2022

pmabbo13 deleted the feature/t5-model branch July 12, 2022 15:28

This was referenced Jul 13, 2022

computing attention scores using relative attention bias #1832

Merged

create T5MultiheadAttention module #1825

Merged

Nayef211 mentioned this pull request Jul 13, 2022

adding forward method for multihead attention #1833

Merged

pmabbo13 added a commit that referenced this pull request Jul 14, 2022

Update base for Update on "computing attention scores using relative …

c8c6dbb

…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]

pmabbo13 added a commit that referenced this pull request Jul 14, 2022

Update on "computing attention scores using relative attention bias"

29d73f2

WIP PR to workshop implementation: #1812 [ghstack-poisoned]

pmabbo13 added a commit that referenced this pull request Jul 15, 2022

Update base for Update on "computing attention scores using relative …

9577497

…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]

pmabbo13 added a commit that referenced this pull request Jul 15, 2022

Update on "computing attention scores using relative attention bias"

a9be77b

WIP PR to workshop implementation: #1812 [ghstack-poisoned]

pmabbo13 added a commit that referenced this pull request Jul 18, 2022

Update base for Update on "computing attention scores using relative …

bbcba30

…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]

pmabbo13 added a commit that referenced this pull request Jul 18, 2022

Update on "computing attention scores using relative attention bias"

95dd347

WIP PR to workshop implementation: #1812 [ghstack-poisoned]

		relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
		return relative_buckets

		values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length)
		return values

		x = self.linear2(self.dropout(self.activation(self.linear1(x))))
		return self.dropout2(x)

		self.dropout = nn.Dropout(dropout)

		def forward(self, tokens: torch.Tensor):

	super(T5MultiheadAttention, self).__init__(
	super().__init__(



		# NOTE: taken from HF; used to compute relative attention bias
		def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):



		# NOTE: modified from HF; used to compute relative attention bias
		def _compute_bias(



		# NOTE: modified from torch.nn.functional._scaled_dot_product_attention to incorporate relative attention bias
		def _t5_scaled_dot_product_attention(

		return attn_output, attn_output_weights, position_bias


		# NOTE: Taken from HF

		from torch import Tensor


		# NOTE: taken from HF; used to compute relative attention bias

[WIP] Add T5 Model to TorchText #1812

[WIP] Add T5 Model to TorchText #1812

Uh oh!

Conversation

pmabbo13 commented Jun 29, 2022

Description

Process

Testing

Uh oh!

pmabbo13 Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmabbo13 Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmabbo13 Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nayef211 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pmabbo13 Jul 6, 2022 •

edited

Loading

pmabbo13 Jul 6, 2022 •

edited

Loading

pmabbo13 Jul 6, 2022 •

edited

Loading

Nayef211 left a comment •

edited

Loading