Why does adding token and positional embeddings in transformers work?

In transformer models, I've noticed that token embeddings and positional embeddings are added together before being passed into the attention layers:

import torch
import torch.nn as nn

class TransformerModel(nn.Module):
    def __init__(self,vocab_size,emb_dim,context_length,dropout_rate):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, emb_dim)
        self.pos_emb = nn.Embedding(context_length, emb_dim])
        self.drop_emb = nn.Dropout(dropout_rate)
    def forward(self,in_idx):
        batch_size,seq_len=in_idx.shape
        '''
        You add token and positional embeddings element-wise to get a combined representation of:
            - What the token is (meaning)
            - Where it is (position)
        For Example: "The cat sat."

            tok_emb["The"] might give you the concept of "the"

            pos_emb[0] tells you it's the first word

        Combined: the model knows it's "the" at the start of the sentence. 
        '''
        tok_embeds=self.tok_emb(in_idx)
        pos_embeds=self.pos_emb(torch.arange(seq_len,device=in_idx.device))
        x=tok_embeds + pos_embeds

But both are just vectors, and addition is elementwise, so how does this preserve both the word's identity and its position? Why does simple addition work so well in practice?

asked May 26 at 21:21

Yilmaz

51k19 gold badges226 silver badges278 bronze badges

The models learn to differentiate between the token embeddings and the positional embeddings, in the same way that they learn to differentiate between "castle" and "stone castle", the token "castle" is the sum of the two embeddings "stone" and "castle" (not exactly, but you get the idea)

Max
– Max

2025-05-27 02:30:59 +00:00
Commented May 27 at 2:30

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does adding token and positional embeddings in transformers work?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest