Feed decoder input in transformers

Question

Reading this tutorial on how to implement an Encoder/Decoder transformer I had some doubts on the training process. Specifically as reported by the original paper the decoder should iteratively use the last iteration output as input of the decoder. However the training step is implemented as

@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  with tf.GradientTape() as tape:
    predictions, _ = transformer([inp, tar_inp],
                                 training = True)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

Where tar_inp is simply a tokenized sentence without the EOS token and tar_real is the same sentence shifted by one position.

However I would have expected the target input (the decoder input) to be iteratively concatenated by previous prediction (or in teacher-forced by incrementing by one ground truth token at a time).

Why is it not the case?

David Dale · Accepted Answer · 2021-11-29 07:57:22Z

2

This particular example actually uses teacher-forcing, but instead of feeding one GT token at a time, it feeds the whole decoder input. However, because the decoder uses only autoregressive (i.e. right-to-left) attention, it can attend only to tokens 0...i-1 when generating the i'th token. Therefore, such training is equivalent to teacher-forcing one token at a time, but is much faster, because all these tokens are predicted in parallel.

answered Nov 29, 2021 at 7:57

David Dale

11.5k48 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RDGuida Over a year ago

Thanks for the explanation. So if I wanted to implement a non teacher-forcing method I would need to effectively run the training steps one by one and concatenating the prediction? Would I need to execute a for loop over the execution of the transformer (referring to the previous piece of code)?

David Dale Over a year ago

Yes, you would need to code some kind of for loop. To speed up this loop, you could cache hidden states that have been already calculated on the previous steps.

Collectives™ on Stack Overflow

Feed decoder input in transformers

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related