while MultiHeadAttention should have to pay attention to two types of attention masks,
- That masks the padding (called padding mask)
- That prevents positions from attending (called attention mask)
While current implementation seems to have only one mask i.e. that masks the padding (1st option).