Training an Transformer Encoder layer directly and the proper way to pad sequences

Question

I am working on a problem in which I want to train a Transformer Encoder Layer directly (i.e. with no embedding layer). I already have the sequences of embeddings that I will treat as my dataset. I am confused about how I should handle the padding and the attention mask and would simply like to make sure that my understanding is correct.

My sequences have lengths varying between as little as 3 to as many as 130. Does this mean that I should pad all my sequences to have 130 parts? If so does it matter which value I pad with?

For the attention mask, I believe that I want each part to attend to all other parts in the sequence. In the docs I see that they have it set up such that each part is only allowed to attend to earlier parts in the sequence. Is this the most natural approach or is it just for the language modeling task? Also why (in the same link) do they use -Inf and 0 for the values of the attention mask as opposed to simply 1s and 0s?

As a little toy example, say that I have two samples in my dataset with sequence lengths of 2 and 3 respectively (AKA 3 is the max):

s_1 = torch.Tensor([0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768]) # length 2
s_2 = torch.Tensor([0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768]) # length 3

Does this mean that I should then pad s_1 to have length 3? And do something like:

s_1 = torch.Tensor([0.001, 0.002, ..., 0.768], [0.001, 0.002, ..., 0.768], [0, 0, ..., 0])

And then my attention masks would then look like:

attn_mask_s1 = [[0    -Inf    0],
                [-Inf   0     0],
                [0      0     0]]


attn_mask_s2 = [[0     -Inf   -Inf],
                [-Inf   0     -Inf],
                [-Inf  -Inf    0  ]]

Sorry to package so many questions into one but they all break down my doubts of how data should be passed to the TransformerEncoder block.

D. ACAR · Accepted Answer · 2021-08-16 06:35:58Z

7

My sequences have lengths varying between as little as 3 to as many as 130. Does this mean that I should pad all my sequences to have 130 parts?

No need... the main property of transformer is that the sequence lengths are changeable (If you look at the dot product or multi head attention formula you can see that) So no need for padding.

For the attention mask, I believe that I want each part to attend to all other parts in the sequence. In the docs I see that they have it set up such that each part is only allowed to attend to earlier parts in the sequence. Is this the most natural approach or is it just for the language modeling task?

Attention mask is for learning sequential generation. You can assume that the transformer is like a RNN and will generate a sequential data one token at a time. Thats why the mask is used in the Transformer decoder. If that does not apply to you problem you can skip it.

The mask being -inf or 1 depends on where you apply it in dot product attention.

answered Aug 16, 2021 at 6:35

D. ACAR

3302 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Luca Guarro Over a year ago

Wow so that makes it easier than I had thought. No padding or attention_mask necessary. I plan on using the TransformerEncoder for a simple binary classification task so no sequential data needs to be generated.

Luca Guarro Over a year ago

One question though, if my sequences have varying lengths since they wont be padded, can I not have my batch of sequenecs as a tensor? How do I get around that issue

D. ACAR Over a year ago

you can put the ones with the same size in the same batch. There is a problem though that is the final layer of you model. You need to be careful about how you configure it because the transformer would give you a sequence not just a vector.

Luca Guarro Over a year ago

mm i see, ultimately I want to take the output from the TransformerEncoder and pass it through a linear layer that gives me the logit scores for each class. What is the best practice to adapt the TransformerEncoder for sequence classification then?

D. ACAR Over a year ago

Then you need to pad your output because the linear layer gets a fixed sized vector. I recommend padding the output as it is way more efficient no extra costly operations. what I recommend is using some modification of memory transformer.

Collectives™ on Stack Overflow

Training an Transformer Encoder layer directly and the proper way to pad sequences

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related