In transformer models, I've noticed that token embeddings and positional embeddings are added together before being passed into the attention layers:
import torch
import torch.nn as nn
class TransformerModel(nn.Module):
def __init__(self,vocab_size,emb_dim,context_length,dropout_rate):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, emb_dim)
self.pos_emb = nn.Embedding(context_length, emb_dim])
self.drop_emb = nn.Dropout(dropout_rate)
def forward(self,in_idx):
batch_size,seq_len=in_idx.shape
'''
You add token and positional embeddings element-wise to get a combined representation of:
- What the token is (meaning)
- Where it is (position)
For Example: "The cat sat."
tok_emb["The"] might give you the concept of "the"
pos_emb[0] tells you it's the first word
Combined: the model knows it's "the" at the start of the sentence.
'''
tok_embeds=self.tok_emb(in_idx)
pos_embeds=self.pos_emb(torch.arange(seq_len,device=in_idx.device))
x=tok_embeds + pos_embeds
But both are just vectors, and addition is elementwise, so how does this preserve both the word's identity and its position? Why does simple addition work so well in practice?