0

In transformer models, I've noticed that token embeddings and positional embeddings are added together before being passed into the attention layers:

import torch
import torch.nn as nn

class TransformerModel(nn.Module):
    def __init__(self,vocab_size,emb_dim,context_length,dropout_rate):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, emb_dim)
        self.pos_emb = nn.Embedding(context_length, emb_dim])
        self.drop_emb = nn.Dropout(dropout_rate)
    def forward(self,in_idx):
        batch_size,seq_len=in_idx.shape
        '''
        You add token and positional embeddings element-wise to get a combined representation of:
            - What the token is (meaning)
            - Where it is (position)
        For Example: "The cat sat."

            tok_emb["The"] might give you the concept of "the"

            pos_emb[0] tells you it's the first word

        Combined: the model knows it's "the" at the start of the sentence. 
        '''
        tok_embeds=self.tok_emb(in_idx)
        pos_embeds=self.pos_emb(torch.arange(seq_len,device=in_idx.device))
        x=tok_embeds + pos_embeds

But both are just vectors, and addition is elementwise, so how does this preserve both the word's identity and its position? Why does simple addition work so well in practice?

1
  • The models learn to differentiate between the token embeddings and the positional embeddings, in the same way that they learn to differentiate between "castle" and "stone castle", the token "castle" is the sum of the two embeddings "stone" and "castle" (not exactly, but you get the idea) Commented May 27 at 2:30

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.