0

I am working on a new Pytorch model which takes sequential data as input and I need to output just a single value, which I will then use a binary cross-entropy function to evaluate as a probability of 1 or 0.

To be more concrete, lets say my sequence is 1000 time steps and only 2 dimensions, like a 2-dimensional sine wave, so the data shape would be 1000 x 2.

I have done something like this before using an RNN, which there is a lot of content online. Because of the recurrent structure of the RNN, in order to do this we just look at final output of the RNN after processing the sequence. In this way the the final step output would be 2 dimensions, then we can apply a linear layer to convert 2 -> 1 dimension, et voila, its done.

MY PROBLEM:

What I am attempting to do now is not using a recurrent network, but instead an encoder with attention (Transformer). So the output of the encoder is now still 1000 steps long and whatever my embedded dimension is, likes say 8. So the output of the sequential encoder is shape 1000 x 8. So my issue is that I need to convert this output to a single value, to which I can apply the binary cross-entropy function. I am not finding an obvious way to do this.

IDEAS:

Traditionally with this kind of sequential model, the encoder feeds into a decoder and the decoder can then output a variable length sequence (this is used to language translation problems). My problem is different in that I don't want to output another sequence but just a single value. Maybe I need to convert the decoder in such a way where this works? The decoder usually takes a target value as well as the output from the encoder as input. The output from the decoder then has the same shape as this target value. An idea would be to use the traditional decoder and give a 1 length target, I would then get a 1 length output and I could use a traditional linear layer to convert this to my desired output. However this doesn't seem entirely logical because I really am not interested in outputting a sequence but just 1 value.

Anyways just looking for some more ideas from the community, if you have any. Thanks!

1
  • I believe you need to step back from implementation (i.e. drop the pytorch part) -- and then it's more of a stats.stackexchange.com question ;-) Commented Dec 30, 2020 at 13:38

1 Answer 1

1

I think this paper does what you wanted :) (Probably not the first paper that does this but it is the one that I recently read)

  1. Prepend an extra token to your sequence. The token can have a learnable embedding.
  2. After the transformer, discard (or not compute) the output at other positions. We only take the output from the first position, and transform it to the target that you needed.

Image taken from the paper:

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

After posting I was thinking about it and after mulling it over it did seem completely reasonable to just predict the "first position" of the output sequence like this. You get to take advantage of the attention mechanism of the decoder and everything. I am going to try it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.