MultiheadAttention building blocks in torchtext #720

zhangguanheng66 · 2020-04-02T18:14:34Z

We propose to refactor nn.MultiheadAttention module as a MHA container:

InProjContainer
ScaledDotProduct
A regular linear layer for out-projection.

The objective is to add more flexibility to try different MHA variants. The new MHA container is capable of

Drop-in replacement. It's easy to switch from nn.MultiheadAttention to MHA container.

To initiate nn.MultiheadAttention:

mha = nn.MultiheadAttention(embed_dim, nhead)

To initiate MHA container:

in_proj_container = InProjContainer(Linear(embed_dim, embed_dim), Linear(embed_dim, embed_dim), Linear(embed_dim, embed_dim))
mha = MultiheadAttentionContainer(nhead, in_proj_container, ScaledDotProduct(), Linear(embed_dim, embed_dim))

attn_output_weights from MHA container is output without averaging. Therefore, for the drop-in replacement above, users will need to average the attention output weights in order to have the same results as nn.MultiheadAttention.
Compatible with torchscript. Ready to add quantization support.
Incremental decoding - bias_k and bias_v will be attached to the sequence dim of key/value

seq_len, bsz = 100, 64
query = key = value = torch.rand((seq_len, bsz, embed_dim))
bias_k = bias_v = torch.rand((1, 1, embed_dim // nhead)).repeat(1, bsz * nhead, 1)
attn_output, attn_weight = MHA(query, key, value, bias_k=bias_k, bias_v=bias_v)

Broadcast and support query/key/value with more than three dimensions. For example, for some CV applications, the input tensors have four dimensions (N, H, W, C) (link)

query = torch.rand((seq_len, 1, embed_dim)) # query's batch dim is 1
key = value = torch.rand((3, 3, seq_len, bsz, embed_dim)) # key and value have five dims
attn_output, attn_weight = MHA(query, key, value)

Resolutions for various research requests. For example, an efficient Transformer architecture is proposed in ref. and Shared-QK transformer for the transformer (nn.activation.MultiheadAttention) module in PyTorch? pytorch#35458. The idea is to share the linear in-projection between query and key with a separate linear layer for value. With the SharedQK_Proj class below, we can drop the custom in-projection module in MHA container as

class SharedQK_Proj(torch.nn.Module):
	def __init__(self, qk_proj, v_proj):
		super(SharedQK_Proj, self).__init__()
		self.qk_proj = qk_proj
		self.v_proj = qk_proj
		
	def forward(self, q, k, v):
		return self.qk_proj(q), self.qk_proj(k), self.v_proj(v)
		
in_proj_container = SharedQK_Proj(Linear(embed_dim, embed_dim), Linear(embed_dim, embed_dim))
MHA = MultiheadAttentionContainer(nhead, in_proj_container,
                                  ScaledDotProduct(), Linear(embed_dim, embed_dim))

Another example is the relative attention implementation introduced in ref. The matrices for relative position distance are added to the the attention layer (see Equation 4 in the reference).

class RelativeAttention(torch.nn.Module):
	def __init__(self):
		super(Relative2DAttention, self).__init__()
		
	def forward(self, query, key, value, attn_mask=None, bias_k=None, bias_v=None):
		query, key, value = query.transpose(-2, -3), key.transpose(-2, -3), value.transpose(-2, -3)
		attn_output_weights = torch.matmul(query, key.transpose(-2, -1))
		
		# a custom func to calculate relative logits in the sequence dimension
		rel_logits = relative_logits(query)
		attn_output_weights += rel_logits_h
		
		attn_output_weights = torch.nn.functional.softmax(attn_output_weights, dim=-1)
		attn_output = torch.matmul(attn_output_weights, value)
		return attn_output.transpose(-2, -3), attn_output_weights
		
MHA = MultiheadAttentionContainer(nhead, in_proj_container,
                                  RelativeAttention(), Linear(embed_dim, embed_dim))

Here is another example to add normalization and dropout in out-projection layer:

class CustomOutProj(torch.nn.Module):
	def __init__(self, in_dim, out_dim, dropout=0.1):
		super(CustomOutProj, self).__init__()
		self.out_proj = torch.nn.Linear(in_dim, out_dim)
		self.norm = torch.nn.LayerNorm(out_dim)
		self.dropout = torch.nn.Dropout(dropout)
        
	def forward(self, seq):
		seq = self.out_proj(seq)
		return self.norm(self.dropout(seq))

in_proj_container = InProjContainer(Linear(embed_dim, embed_dim), Linear(embed_dim, embed_dim), Linear(embed_dim, embed_dim))
MHA = MultiheadAttentionContainer(nhead, in_proj_container,
                                  ScaledDotProduct(), CustomOutProj(embed_dim, embed_dim))

…eadInProject, MultiheadOutProject

…adattention

…tch dim of either query or key/value to be 1

fmassa

Did a quick pass on the benchmark scripts, and I think we can still improve it (specially for CUDA).

This could explain why the MHA implementation in here seems to be significantly faster than the PyTorch one (which has a number of sync points internally).

fmassa · 2020-05-07T13:19:43Z

benchmark/mha_block.py

+                                           attn_mask=torch.stack([attn_mask_2D] * (bsz * nhead)),
+                                           bias_k=bias_k.repeat(1, bsz, 1).reshape(1, bsz * nhead, -1),
+                                           bias_v=bias_v.repeat(1, bsz, 1).reshape(1, bsz * nhead, -1))
+        print(time.monotonic() - t0)


If you are benchmarking with CUDA, you need to add a torch.cuda.synchronize() before and after measuring the time, otherwise the timings won't be correct

Thanks. Will add them there.

The reason for this is that calls into cuda verions of operations are launched asynchronously. Only when you print a Tensor or convert it onto CPU can you be sure all operations have finished. Using synchronize here helps you make sure indeed all the work has finished and you're timing things correctly. Also see torch.cuda.

@zhangguanheng66 Could you share with us how your implementation performs compared to the PyTorch one after you have fixed the timing? Thanks.

fmassa · 2020-05-07T13:20:18Z

benchmark/mha_block.py

+                                                              MHA.out_proj.weight,
+                                                              MHA.out_proj.bias,
+                                                              attn_mask=torch_attn_mask)
+        print(time.monotonic() - t0)


Same comment here.

fmassa · 2020-05-07T13:23:29Z

benchmark/mha_block.py

+        print(time.monotonic() - t0)
+
+    print("*" * 80)
+    print("test case GPU with embed_dim, nhead, tgt_len, src_len, bsz:", 768, 12, 128, 128, 72)


I believe most of the potential speed benefits from the MHA implemented in PyTorch are only valid when query = key = value (because it computes the projections in a single kernel launch for the 3).
Can you add more benchmarks for different sizes in the query = key = value case? A for loop would be helpful there, something like

for embed_dim in [256, 768]: for ... for ... print(...) _run_benchmark(...)

We've run benchmarks on this and it depends on the size of the inputs as well. For large inputs, as you can probably imagine, it shouldn't make much of a difference since the overhead disappears.

cpuhrsch · 2020-05-15T16:53:56Z

torchtext/modules/multiheadattention.py

+        head_dim = v.size(-1) // self.nhead
+        v = v.reshape(src_len, bsz * self.nhead, head_dim)
+
+        attn_output, attn_output_weights = self.attention_layer(q, k, v, attn_mask=attn_mask,


It seems that for this container in particular there are no assumptions made on the dtype of attn_mask. I think we can relax that constraint. It stems from the fact that ScaledDotProduct needs a BoolTensor as a mask, but not for the container.

myleott · 2020-05-15T16:59:45Z

torchtext/modules/multiheadattention.py

+        # Dot product of q, k
+        attn_output_weights = torch.matmul(query, key.transpose(-2, -1))
+        if attn_mask is not None:
+            attn_output_weights.masked_fill_(attn_mask, float('-inf'),)


I believe for some speech use case they needed to use -1e8 instead of -inf to avoid NaN: https://github.com/pytorch/fairseq/blob/928dc47e7e72f3e6ed96e50942e7fb8892cdcf32/fairseq/modules/transformer_layer.py#L108-L112

Does it make sense to have this be user configurable?

I think I will follow the convention in fairseq. We could add this user configurable later.

Also, since this is part of ScaledDotProduct we can create variants of ScaledDotProduct that are more flexible for this kind of stuff. I think we'll end up with a small collection of attention functions and maybe we'll come up with some common building blocks there as well.

netw0rkf10w · 2020-07-09T19:01:59Z

@zhangguanheng66 In the docstrings, it seems torchtext.models should be replaced by torchtext.modules.

zhangguanheng66 · 2020-07-09T19:32:12Z

@zhangguanheng66 In the docstrings, it seems torchtext.models should be replaced by torchtext.modules.

I have a PR to update the doc.

netw0rkf10w · 2020-07-12T16:17:42Z

@zhangguanheng66 There seems to be some discrepancy compared to the PyTorch implementation, which has bias_k and bias_v as learnable parameters of the MHA (see here). In yours, they are tensors passed as inputs to the MHA's forward function. Is there a good reason for this? Thanks.

zhangguanheng66 · 2020-07-12T18:24:23Z

@zhangguanheng66 There seems to be some discrepancy compared to the PyTorch implementation, which has bias_k and bias_v as learnable parameters of the MHA (see here). In yours, they are tensors passed as inputs to the MHA's forward function. Is there a good reason for this? Thanks.

In pytorch MHA, bias_k and bias_v are learnable variables. For MHA container, those are two tensors attached to key/value for incremental decoding. These are different things.

netw0rkf10w · 2020-07-12T18:33:41Z

@zhangguanheng66 Thanks. I've successfully built a MHA layer using your implementation. Its outputs numerically match the ones of PyTorch implementation (I had to write an auxiliary function to convert the state dict of the latter to the former).

netw0rkf10w · 2020-07-13T13:14:02Z

torchtext/modules/multiheadattention.py

+            value = torch.cat([value, bias_v])
+            if attn_mask is not None:
+                _attn_mask = attn_mask
+                attn_mask = torch.nn.functional.pad(_attn_mask, (0, 1))


@zhangguanheng66 Why not simply attn_mask = torch.nn.functional.pad(attn_mask, (0, 1))?

Update in the revised_mha PR link

netw0rkf10w · 2020-07-14T14:56:11Z

@zhangguanheng66 There seems to be some discrepancy compared to the PyTorch implementation, which has bias_k and bias_v as learnable parameters of the MHA (see here). In yours, they are tensors passed as inputs to the MHA's forward function. Is there a good reason for this? Thanks.

In pytorch MHA, bias_k and bias_v are learnable variables. For MHA container, those are two tensors attached to key/value for incremental decoding. These are different things.

@zhangguanheng66 I am trying to understand bias_k and bias_v. Could you please point me to some references? Is it the same as fairseq's incremental decoding described below? Thank you very much!

Incremental decoding is a special mode at inference time where the Model
    only receives a single timestep of input corresponding to the previous
    output token (for teacher forcing) and must produce the next output
    *incrementally*. Thus the model must cache any long-term state that is
    needed about the sequence, e.g., hidden states, convolutional states, etc.

zhangguanheng66 · 2020-07-14T18:13:43Z

@zhangguanheng66 There seems to be some discrepancy compared to the PyTorch implementation, which has bias_k and bias_v as learnable parameters of the MHA (see here). In yours, they are tensors passed as inputs to the MHA's forward function. Is there a good reason for this? Thanks.

In pytorch MHA, bias_k and bias_v are learnable variables. For MHA container, those are two tensors attached to key/value for incremental decoding. These are different things.

@zhangguanheng66 I am trying to understand bias_k and bias_v. Could you please point me to some references? Is it the same as fairseq's incremental decoding described below? Thank you very much!
Incremental decoding is a special mode at inference time where the Model
    only receives a single timestep of input corresponding to the previous
    output token (for teacher forcing) and must produce the next output
    *incrementally*. Thus the model must cache any long-term state that is
    needed about the sequence, e.g., hidden states, convolutional states, etc.

Kind of. From from code view, it pads an extra token in the sequence dimension of key/value.

netw0rkf10w · 2020-09-03T14:40:24Z

torchtext/modules/multiheadattention.py

+        # Dot product of q, k
+        attn_output_weights = torch.matmul(query, key.transpose(-2, -1))
+        if attn_mask is not None:
+            attn_output_weights.masked_fill_(attn_mask, -1e8,)


@zhangguanheng66 To numerically match torch's implementation, this line should change to attn_output_weights.masked_fill_(attn_mask, float('-inf')).

@netw0rkf10w There are some ongoing discussions about NaN output for some special cases. We tried to avoid this when implementing MHA container in torchtext. I believe we will modify this accordingly as pytorch/pytorch#42323 concludes.

@zhangguanheng66 Great. Thanks for the information! I'll join that discussion later.

Guanheng Zhang added 11 commits April 2, 2020 08:49

add MHA building blocks in torchtext

a603d7b

add docs

5d06447

combine forward function with functional

e9e18cb

add models to init

36c876a

Merge branch 'master' into mha_blocks

f7f2816

minor revisions

bddc782

minor change

e665f38

revision

4a44337

Add unit test

fa36f85

update docs

45fed34

flake8

a6a2d94

zhangguanheng66 mentioned this pull request Apr 6, 2020

[DO NOT MERGE]Decompose MultiheadAttention module pytorch/pytorch#34793

Closed

5 tasks

Guanheng Zhang added 13 commits April 8, 2020 07:41

add MultiheadAttentionContainer

b741e1f

Merge branch 'master' into mha_blocks

86a1641

update models init file

ba8cd3a

update docs of container

1c35a05

update MHA test

5415568

remove in/out projection

2055c16

Switch MultiheadAttentionContainer to accept ScaledDotProduct, Multih…

9adc723

…eadInProject, MultiheadOutProject

Merge branch 'master' into mha_blocks

9e4e0b7

add JIT support for MHA blocks

f94506a

standardlize attn_mask

f3ed887

update docs

4a38802

fix a bug in torchscript test

a5bfdee

add attn_mask in test_multiheadattention and test_torchscript_multihe…

e81c4b3

…adattention

zhangguanheng66 force-pushed the mha_blocks branch 2 times, most recently from 1765b69 to 2b9b68c Compare April 16, 2020 20:28

add partial broadcast support for ScaledDotProduct. Only allow the ba…

66b71ac

…tch dim of either query or key/value to be 1

zhangguanheng66 force-pushed the mha_blocks branch from 2b9b68c to 66b71ac Compare April 16, 2020 23:06

add more broadcast tests for scaled dot product model

da1bc7a

Guanheng Zhang added 4 commits May 4, 2020 08:12

minor docs update

3380012

Merge remote-tracking branch 'upstream/master' into mha_blocks

156ee4d

Merge branch 'master' into mha_blocks

5c51e2c

add self-attention in the benchmark

f02073f

fmassa reviewed May 7, 2020

View reviewed changes

Guanheng Zhang added 2 commits May 7, 2020 08:16

update benchmark test with more cases

9a0a789

Merge remote-tracking branch 'upstream/master' into mha_blocks

6e9adf4

zhangguanheng66 mentioned this pull request May 13, 2020

BERT example in torchtext #767

Merged

4 tasks

Merge branch 'master' into mha_blocks

941d184

cpuhrsch reviewed May 15, 2020

View reviewed changes

myleott reviewed May 15, 2020

View reviewed changes

Guanheng Zhang added 9 commits May 15, 2020 12:35

Merge branch 'master' into mha_blocks

c0c3152

update attn_mask

9f2491a

add generate_square_subsequent_mask

8771e3f

Merge branch 'master' into mha_blocks

097f690

Merge branch 'master' into mha_blocks

c958652

update docs in MHA container

8b50742

Merge branch 'master' into mha_blocks

f799ef4

Merge branch 'master' into mha_blocks

496c43b

add InProjContainer in docs

7078c93

cpuhrsch merged commit 4b32cf5 into pytorch:master Jun 16, 2020

netw0rkf10w reviewed Jul 13, 2020

View reviewed changes

netw0rkf10w reviewed Sep 3, 2020

View reviewed changes

MultiheadAttention building blocks in torchtext #720

MultiheadAttention building blocks in torchtext #720

Uh oh!

Conversation

zhangguanheng66 commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netw0rkf10w commented Jul 9, 2020

Uh oh!

zhangguanheng66 commented Jul 9, 2020

Uh oh!

netw0rkf10w commented Jul 12, 2020

Uh oh!

zhangguanheng66 commented Jul 12, 2020

Uh oh!

netw0rkf10w commented Jul 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netw0rkf10w commented Jul 14, 2020

Uh oh!

zhangguanheng66 commented Jul 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhangguanheng66 commented Apr 2, 2020 •

edited

Loading

zhangguanheng66 Jul 13, 2020 •

edited

Loading