I am working on a problem in which I want to train a Transformer Encoder Layer directly (i.e. with no embedding layer). I already have the sequences of embeddings that I will treat as my dataset. I am confused about how I should handle the padding and the attention mask and would simply like to make sure that my understanding is correct.
My sequences have lengths varying between as little as 3 to as many as 130. Does this mean that I should pad all my sequences to have 130 parts? If so does it matter which value I pad with?
For the attention mask, I believe that I want each part to attend to all other parts in the sequence. In the docs I see that they have it set up such that each part is only allowed to attend to earlier parts in the sequence. Is this the most natural approach or is it just for the language modeling task? Also why (in the same link) do they use -Inf and 0 for the values of the attention mask as opposed to simply 1s and 0s?
As a little toy example, say that I have two samples in my dataset with sequence lengths of 2 and 3 respectively (AKA 3 is the max):
s_1 = torch.Tensor([0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768]) # length 2
s_2 = torch.Tensor([0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768]) # length 3
Does this mean that I should then pad s_1 to have length 3? And do something like:
s_1 = torch.Tensor([0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768], [0, 0, ..., 0])
And then my attention masks would then look like:
attn_mask_s1 = [[0 -Inf 0],
[-Inf 0 0],
[0 0 0]]
attn_mask_s2 = [[0 -Inf -Inf],
[-Inf 0 -Inf],
[-Inf -Inf 0 ]]
Sorry to package so many questions into one but they all break down my doubts of how data should be passed to the TransformerEncoder block.