0

I'm converting a Tensorflow transformer model to Pytorch equivalent. In TF multi-head attention part of the code I have: att = layers.MultiHeadAttention(num_heads=6, key_dim=4) and the input shape is [None, 136, 4] where None is the batch size, 136 is sequence length and 4 is embedding dimension. num_heads is number of heads and key_dim is dimension of each head. It gets the input as : att(query=input, value=input)

In pyTorch, the MHA is defined as att = nn.MultiheadAttention(embed_dim, num_heads) where embed_dim is the dimension of the model (Not dim of each head) and it must be divisible by num_heads = 6. Since the input shape is [None, 136, 4] and 4 is not divisible by 6, Pytorch rises an error about divisibility. how should I change my input to be able to use Pytorch instead of TF?

If key_dim in Tf is 4 and it has 6 heads, the whole model's dimension must be 46. I defined embed_dim in PyTorch equal to 46 but because of input size of [None, 136, 4], Pytorch rises error for dimension saying " expected input of size 24 but got 4". Is it ok to repeat my input for num_head times to fix the problem? Does TF feeds the input directly to each head without dividing it to number of heads? How can I convert TF MHA to Pytorch MHA?

2
  • Here is the opposite (it may give some pointer). gist.github.com/innat/e88b096390985e806299b4a3dccc5118 Commented Sep 28, 2023 at 11:52
  • Seems Tensorflow let you directly feed same input to either each head or dimensionaly divide it into them but Pytorch doesn't have this option (or I couldn't figure it out). so I wrote my own MHA in Pytorch in the way TF designed it. However, thank you for responding. Commented Sep 29, 2023 at 21:55

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.