-
Notifications
You must be signed in to change notification settings - Fork 814
[WIP] Add T5 Model to TorchText #1812
Conversation
| relative_buckets += torch.where(is_small, relative_position, relative_position_if_large) | ||
| return relative_buckets | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_compute_bias() is same as HF implementation, except we explicitly pass in parameters relative_attention_bias, relative_attention_num_buckets, relative_attention_max_distance, and bidirectional
|
|
||
| import torch | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_relative_position_bucket() is identical to HF implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parmeet @mthrok do you know the guidance for citing code that is taken from the HF library? Looks like they have a Apache 2.0 licence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general guideline is to retain license headers and license description.
But one should consult https://fburl.com/y41qjtzl to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this as one way of citing in Pytorch core. Maybe we could do something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consulted internally, since I have to do similar for BERT Tokenizer implementation. In summary, we need to include modified header in source code. You can refer to header in BERT tokenizer CPP file.
| values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length) | ||
| return values | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_t5_scaled_dot_product_attention() is modified from pytorch implementation to incorporate relative attention bias when computing the attention scores. position_bias is passed in as a parameter (line 90) and is added to the attention score computation in lines 119-124, 129, 133
| else: | ||
| return attn_output, attn_output_weights, position_bias | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T5 model uses a root-mean-square layer normalization. T5LayerNorm implements this and is taken directly from HF
| output = torch.bmm(attn, v) | ||
| return output, attn | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t5_multi_head_attention_forward() is modified from its pytorch implementation to incorporate relative attention bias. We've added parameters compute_relative_attention_bias, relative_attention_bias, relative_attention_num_buckets, relative_attention_max_distance, and position_bias which are used in lines 432-444, 449. This implementation was inspired by a similar HF implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it absolutely necessary that this implementation is a plain function, instead of a method in T5 class?
I see that many attributes from the model instance is being passed, which makes me wonder what is the value-add in T5 model class implementation.
PyTorch's implementation (which was originated from torchtext) is trying to cover many use cases, so the signature is complex, where as this implementation is only required to serve the needs of T5 model, can't it be simplified?
|
|
||
| return attn_output, None, position_bias | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T5MultiheadAttention inherits from pytorch's MultiheadAttention. The forward method is a trimmed down version of theirs that includes necessary parameters to incorporate relative attention bias. Pytorch's implementation includes some torchscript optimizations (lines 1069-1126) that I wasn't too sure about so I omitted for now.
|
|
||
| return self.weight * hidden_states | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T5EncoderLayer inherits from pytorch's TransformerEncoderLayer. It is initialized to incorporate relative attention bias parameters, use T5MultiheadAttention and T5LayerNorm, and remove the bias from linear layers (lines 627-636). The forward method is also taken from the pytorch implementation and modified to incorporate position bias (lines 656-694). The pytorch method included some torchscript optimizations (lines 410-457) that I omitted for now.
| x = self.linear2(self.dropout(self.activation(self.linear1(x)))) | ||
| return self.dropout2(x) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T5Encoder inherits from pytorch's TransformerEncoder. The initialization is mostly the same, except that we only compute relative attention bias in the first layer, and the resulting tensor is passed up to the higher layers to avoid re-computing. Lines 732-735 incorporate this distinction. The forward method is a trimmed down version of the pytorch implementation's, which included torchscript optimizations (lines 206-232) that I've omitted for now. They've also implemented something called convert_to_nested which I'm not too sure of its purpose.
|
|
||
| return output | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T5EncoderModel is the complete encoder model that includes the word embeddings, and the final layer normalizations and dropouts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the implementation looks fine to me in terms of class separation and design. To verify correctness, let's compare the outputs of the HF implementation with ours as we discussed. We can take a closer look at the components once we break up this PR into several smaller ones after we've verified correctness of the model!
@abhinavarora it would be useful to get your feedback on whether all of the changes from the T5 paper are represented correctly in the code.
|
|
||
| import torch | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parmeet @mthrok do you know the guidance for citing code that is taken from the HF library? Looks like they have a Apache 2.0 licence
| def __init__( | ||
| self, | ||
| embed_dim, | ||
| num_heads, | ||
| dropout=0.0, | ||
| bias=False, | ||
| add_bias_kv=False, | ||
| add_zero_attn=False, | ||
| kdim=None, | ||
| vdim=None, | ||
| batch_first=False, | ||
| device=None, | ||
| dtype=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When inheriting from a Parent class, I would make use of *args and **kwargs to pass in multiple arguments and keyword arguments to your parent class. In the T5MultiheadAttention class, you don't make use of any of the input args during initialization so using *args and **kwargs and passing them to the Parent class for initialization would be less verbose.
@mthrok wanted to double check if you guys follow this practice in torchaudio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on the context. There are pros and cons for both approach. If the signature of the constructor is a public API, then explicitly listing could be helpful when documenting. (at the same time, it could appear cluttered) Even if it is not public, if it is expected that the signature can change many time then, explicit is better.
Assuming that this class is just a component of the T5 model class, won't change, and not intended for direct use, the use of args and kwargs will make the code shorter, and I think that is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mthrok with regard to usage. I think in general, it wouldn't hurt to keep the arguments explicit :) (Imaging some enthusiastic user who quickly want to try out this new T5 style MHA in their work, it would be bit easier, of-course it is at their risk since we haven't yet made the API public).
| class T5EncoderLayer(nn.TransformerEncoderLayer): | ||
| def __init__( | ||
| self, | ||
| d_model: int = 768, | ||
| nhead: int = 12, | ||
| dim_feedforward: int = 3072, | ||
| dropout: float = 0.1, | ||
| activation: Union[str, Callable[[Tensor], Tensor]] = F.relu, | ||
| layer_norm_eps: float = 1e-6, | ||
| batch_first: bool = False, | ||
| norm_first: bool = True, | ||
| compute_relative_attention_bias: bool = False, | ||
| relative_attention_num_buckets: int = 32, | ||
| relative_attention_max_distance: int = 128, | ||
| relative_attention_bias: Optional[Tensor] = None, | ||
| device=None, | ||
| dtype=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about using *args and **kwargs
| def __init__( | ||
| self, | ||
| d_model: int, | ||
| d_feedforward: int, | ||
| dropout: float, | ||
| activation: Union[str, Callable[[Tensor], Tensor]], | ||
| layer_norm_eps: float, | ||
| num_heads: int, | ||
| num_layers: int, | ||
| batch_first: bool, | ||
| relative_attention_num_buckets: int, | ||
| relative_attention_max_distance: int, | ||
| padding_idx: int, | ||
| max_seq_len: int, | ||
| vocab_size: int, | ||
| ) -> None: | ||
| super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's add docstrings explaining the purpose of the class and all of the input arguments before we actually merge in the implementation
| self.dropout = nn.Dropout(dropout) | ||
|
|
||
| def forward(self, tokens: torch.Tensor): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove new line
| dtype=None, | ||
| ) -> None: | ||
|
|
||
| super(T5MultiheadAttention, self).__init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super(<CLASS_NAME>, self) is the syntax from Python 2 era. It can be as simple as
| super(T5MultiheadAttention, self).__init__( | |
| super().__init__( |
| ): | ||
| super(T5Encoder, self).__init__(encoder_layer, num_layers, norm, enable_nested_tensor) | ||
|
|
||
| first_layer = copy.deepcopy(encoder_layer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the use of deep copy is guaranteed to keep working with PyTorch modules.
Instead of expecting an instance, it is more elegant to accept parameters required to build the component and instantiate encoders as much as needed.
|
|
||
|
|
||
| # NOTE: taken from HF; used to compute relative attention bias | ||
| def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add type annotations for consistency?
|
|
||
|
|
||
| # NOTE: modified from HF; used to compute relative attention bias | ||
| def _compute_bias( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add type annotations for consistency?
|
|
||
| import torch | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consulted internally, since I have to do similar for BERT Tokenizer implementation. In summary, we need to include modified header in source code. You can refer to header in BERT tokenizer CPP file.
|
|
||
|
|
||
| # NOTE: modified from torch.nn.functional._scaled_dot_product_attention to incorporate relative attention bias | ||
| def _t5_scaled_dot_product_attention( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmabbo13 for easier follow-up, could you please add comment directly in the source code that is modified to incorporate relative attention bias?
| def __init__( | ||
| self, | ||
| embed_dim, | ||
| num_heads, | ||
| dropout=0.0, | ||
| bias=False, | ||
| add_bias_kv=False, | ||
| add_zero_attn=False, | ||
| kdim=None, | ||
| vdim=None, | ||
| batch_first=False, | ||
| device=None, | ||
| dtype=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mthrok with regard to usage. I think in general, it wouldn't hurt to keep the arguments explicit :) (Imaging some enthusiastic user who quickly want to try out this new T5 style MHA in their work, it would be bit easier, of-course it is at their risk since we haven't yet made the API public).
| return attn_output, attn_output_weights, position_bias | ||
|
|
||
|
|
||
| # NOTE: Taken from HF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also provide the link of the source code.
| return relative_buckets | ||
|
|
||
|
|
||
| # NOTE: modified from HF; used to compute relative attention bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also provide the link to original source code in comment.
| from torch import Tensor | ||
|
|
||
|
|
||
| # NOTE: taken from HF; used to compute relative attention bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also provide the link to source code as comment
# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]
…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]
# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]
WIP PR to workshop implementation: #1812 [ghstack-poisoned]
…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]
WIP PR to workshop implementation: #1812 [ghstack-poisoned]
# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]
…attention bias" WIP PR to workshop implementation: #1812 [ghstack-poisoned]
WIP PR to workshop implementation: #1812 [ghstack-poisoned]
# Description Add T5 architecture to torchtext # Process The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block. Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF. Given the above components, we can then define our own T5Layer, T5Stack, and T5Model. * The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output. * T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack. * T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens. # Testing not yet implemented # Stack WIP PR where implementation details were discussed: #1812 [ghstack-poisoned]
Description
Add T5 architecture to torchtext
Process
The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.
Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.
Given the above components, we can then define our own T5EncoderLayer, T5DecoderLayer, T5Encoder, T5Decoder, and T5 modules, all of which can inherit from torch.nn.TransformerEncoderLayer, torch.nn.TransformerDecoderLayer, etc.
Testing
not yet implemented