Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@pmabbo13
Copy link
Contributor

Description

Add T5 architecture to torchtext

Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5EncoderLayer, T5DecoderLayer, T5Encoder, T5Decoder, and T5 modules, all of which can inherit from torch.nn.TransformerEncoderLayer, torch.nn.TransformerDecoderLayer, etc.

Testing

not yet implemented

relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
return relative_buckets


Copy link
Contributor Author

@pmabbo13 pmabbo13 Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compute_bias() is same as HF implementation, except we explicitly pass in parameters relative_attention_bias, relative_attention_num_buckets, relative_attention_max_distance, and bidirectional


import torch


Copy link
Contributor Author

@pmabbo13 pmabbo13 Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_relative_position_bucket() is identical to HF implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmeet @mthrok do you know the guidance for citing code that is taken from the HF library? Looks like they have a Apache 2.0 licence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general guideline is to retain license headers and license description.
But one should consult https://fburl.com/y41qjtzl to be sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this as one way of citing in Pytorch core. Maybe we could do something similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consulted internally, since I have to do similar for BERT Tokenizer implementation. In summary, we need to include modified header in source code. You can refer to header in BERT tokenizer CPP file.

values = values.permute([2, 0, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length)
return values


Copy link
Contributor Author

@pmabbo13 pmabbo13 Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_t5_scaled_dot_product_attention() is modified from pytorch implementation to incorporate relative attention bias when computing the attention scores. position_bias is passed in as a parameter (line 90) and is added to the attention score computation in lines 119-124, 129, 133

else:
return attn_output, attn_output_weights, position_bias


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5 model uses a root-mean-square layer normalization. T5LayerNorm implements this and is taken directly from HF

output = torch.bmm(attn, v)
return output, attn


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t5_multi_head_attention_forward() is modified from its pytorch implementation to incorporate relative attention bias. We've added parameters compute_relative_attention_bias, relative_attention_bias, relative_attention_num_buckets, relative_attention_max_distance, and position_bias which are used in lines 432-444, 449. This implementation was inspired by a similar HF implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it absolutely necessary that this implementation is a plain function, instead of a method in T5 class?

I see that many attributes from the model instance is being passed, which makes me wonder what is the value-add in T5 model class implementation.

PyTorch's implementation (which was originated from torchtext) is trying to cover many use cases, so the signature is complex, where as this implementation is only required to serve the needs of T5 model, can't it be simplified?


return attn_output, None, position_bias


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5MultiheadAttention inherits from pytorch's MultiheadAttention. The forward method is a trimmed down version of theirs that includes necessary parameters to incorporate relative attention bias. Pytorch's implementation includes some torchscript optimizations (lines 1069-1126) that I wasn't too sure about so I omitted for now.


return self.weight * hidden_states


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5EncoderLayer inherits from pytorch's TransformerEncoderLayer. It is initialized to incorporate relative attention bias parameters, use T5MultiheadAttention and T5LayerNorm, and remove the bias from linear layers (lines 627-636). The forward method is also taken from the pytorch implementation and modified to incorporate position bias (lines 656-694). The pytorch method included some torchscript optimizations (lines 410-457) that I omitted for now.

x = self.linear2(self.dropout(self.activation(self.linear1(x))))
return self.dropout2(x)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5Encoder inherits from pytorch's TransformerEncoder. The initialization is mostly the same, except that we only compute relative attention bias in the first layer, and the resulting tensor is passed up to the higher layers to avoid re-computing. Lines 732-735 incorporate this distinction. The forward method is a trimmed down version of the pytorch implementation's, which included torchscript optimizations (lines 206-232) that I've omitted for now. They've also implemented something called convert_to_nested which I'm not too sure of its purpose.


return output


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5EncoderModel is the complete encoder model that includes the word embeddings, and the final layer normalizations and dropouts

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the implementation looks fine to me in terms of class separation and design. To verify correctness, let's compare the outputs of the HF implementation with ours as we discussed. We can take a closer look at the components once we break up this PR into several smaller ones after we've verified correctness of the model!

@abhinavarora it would be useful to get your feedback on whether all of the changes from the T5 paper are represented correctly in the code.


import torch


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmeet @mthrok do you know the guidance for citing code that is taken from the HF library? Looks like they have a Apache 2.0 licence

Comment on lines +476 to +488
def __init__(
self,
embed_dim,
num_heads,
dropout=0.0,
bias=False,
add_bias_kv=False,
add_zero_attn=False,
kdim=None,
vdim=None,
batch_first=False,
device=None,
dtype=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When inheriting from a Parent class, I would make use of *args and **kwargs to pass in multiple arguments and keyword arguments to your parent class. In the T5MultiheadAttention class, you don't make use of any of the input args during initialization so using *args and **kwargs and passing them to the Parent class for initialization would be less verbose.

@mthrok wanted to double check if you guys follow this practice in torchaudio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the context. There are pros and cons for both approach. If the signature of the constructor is a public API, then explicitly listing could be helpful when documenting. (at the same time, it could appear cluttered) Even if it is not public, if it is expected that the signature can change many time then, explicit is better.

Assuming that this class is just a component of the T5 model class, won't change, and not intended for direct use, the use of args and kwargs will make the code shorter, and I think that is reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mthrok with regard to usage. I think in general, it wouldn't hurt to keep the arguments explicit :) (Imaging some enthusiastic user who quickly want to try out this new T5 style MHA in their work, it would be bit easier, of-course it is at their risk since we haven't yet made the API public).

Comment on lines +605 to +621
class T5EncoderLayer(nn.TransformerEncoderLayer):
def __init__(
self,
d_model: int = 768,
nhead: int = 12,
dim_feedforward: int = 3072,
dropout: float = 0.1,
activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
layer_norm_eps: float = 1e-6,
batch_first: bool = False,
norm_first: bool = True,
compute_relative_attention_bias: bool = False,
relative_attention_num_buckets: int = 32,
relative_attention_max_distance: int = 128,
relative_attention_bias: Optional[Tensor] = None,
device=None,
dtype=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about using *args and **kwargs

Comment on lines +762 to +778
def __init__(
self,
d_model: int,
d_feedforward: int,
dropout: float,
activation: Union[str, Callable[[Tensor], Tensor]],
layer_norm_eps: float,
num_heads: int,
num_layers: int,
batch_first: bool,
relative_attention_num_buckets: int,
relative_attention_max_distance: int,
padding_idx: int,
max_seq_len: int,
vocab_size: int,
) -> None:
super().__init__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's add docstrings explaining the purpose of the class and all of the input arguments before we actually merge in the implementation

self.dropout = nn.Dropout(dropout)

def forward(self, tokens: torch.Tensor):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove new line

dtype=None,
) -> None:

super(T5MultiheadAttention, self).__init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super(<CLASS_NAME>, self) is the syntax from Python 2 era. It can be as simple as

Suggested change
super(T5MultiheadAttention, self).__init__(
super().__init__(

):
super(T5Encoder, self).__init__(encoder_layer, num_layers, norm, enable_nested_tensor)

first_layer = copy.deepcopy(encoder_layer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the use of deep copy is guaranteed to keep working with PyTorch modules.

Instead of expecting an instance, it is more elegant to accept parameters required to build the component and instantiate encoders as much as needed.



# NOTE: taken from HF; used to compute relative attention bias
def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add type annotations for consistency?



# NOTE: modified from HF; used to compute relative attention bias
def _compute_bias(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add type annotations for consistency?


import torch


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consulted internally, since I have to do similar for BERT Tokenizer implementation. In summary, we need to include modified header in source code. You can refer to header in BERT tokenizer CPP file.



# NOTE: modified from torch.nn.functional._scaled_dot_product_attention to incorporate relative attention bias
def _t5_scaled_dot_product_attention(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmabbo13 for easier follow-up, could you please add comment directly in the source code that is modified to incorporate relative attention bias?

Comment on lines +476 to +488
def __init__(
self,
embed_dim,
num_heads,
dropout=0.0,
bias=False,
add_bias_kv=False,
add_zero_attn=False,
kdim=None,
vdim=None,
batch_first=False,
device=None,
dtype=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mthrok with regard to usage. I think in general, it wouldn't hurt to keep the arguments explicit :) (Imaging some enthusiastic user who quickly want to try out this new T5 style MHA in their work, it would be bit easier, of-course it is at their risk since we haven't yet made the API public).

return attn_output, attn_output_weights, position_bias


# NOTE: Taken from HF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also provide the link of the source code.

return relative_buckets


# NOTE: modified from HF; used to compute relative attention bias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also provide the link to original source code in comment.

from torch import Tensor


# NOTE: taken from HF; used to compute relative attention bias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also provide the link to source code as comment

pmabbo13 added a commit that referenced this pull request Jul 14, 2022
# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 14, 2022
…attention bias"


WIP PR to workshop implementation: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 14, 2022
# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 14, 2022
WIP PR to workshop implementation: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 15, 2022
…attention bias"


WIP PR to workshop implementation: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 15, 2022
WIP PR to workshop implementation: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 18, 2022
# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 18, 2022
…attention bias"


WIP PR to workshop implementation: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 18, 2022
WIP PR to workshop implementation: #1812 

[ghstack-poisoned]
pmabbo13 added a commit that referenced this pull request Jul 18, 2022
# Description

Add T5 architecture to torchtext

# Process

The T5 architecture is very similar to the architecture of a traditional transformer. The main differences are that rather than using positional embeddings, it computes a relative attention bias that encodes the relative position of a token within a sequence. This position bias is then passed into each layer and used to compute the attention scores. T5 also uses a simplified layer normalization (root mean square normalization) which occurs at the start of every attention and feed-forward block.

Incorporating relative attention bias requires under the hood changes to the MultiHeadAttention module. We can use HF's implementation for computing relative attention bias and modify the source code for torch.nn.MultiHeadAttention to incorporate relative attention bias. We can also create our own layer normalization, similarly to HF.

Given the above components, we can then define our own T5Layer, T5Stack, and T5Model.
* The T5Layer can be used either as an encoder layer or decoder layer based on an input boolean parameter. The only difference between the decoder layer versus the encoder layer is that the decoder layer also performs cross-attention with the encoder output.
* T5Stack can also be used as either an encoder or decoder based on an input boolean parameter. This dictates which type of layer composes the stack.
* T5Model can be used either as an encoder-only or encoder-decoder model based on an input boolean parameter. If it is an encoder-decoder model, a causal mask is generated for the decoder input tokens.

# Testing
not yet implemented

# Stack

WIP PR where implementation details were discussed: #1812 

[ghstack-poisoned]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants