0

I'm testing out different implementation of LSTM autoencoder on anomaly detection on 2D input. My question is not about the code itself but about understanding the underlying behavior of each network.

Both implementation have the same number of units (16). Model 2 is a "typical" seq to seq autoencoder with the last sequence of the encoder repeated "n" time to match the input of the decoder. I'd like to understand why Model 1 seem to easily over-perform Model 2 and why Model 2 isn't able to do better than the mean ?

Model 1:

class LSTM_Detector(Model):
  def __init__(self, flight_len, param_len, hidden_state=16):
    super(LSTM_Detector, self).__init__()
    self.input_dim = (flight_len, param_len)
    self.units = hidden_state
    self.encoder = layers.LSTM(self.units,
                  return_state=True,
                  return_sequences=True,
                  activation="tanh",
                  name='encoder',
                  input_shape=self.input_dim)
    
    self.decoder = layers.LSTM(self.units,
                  return_sequences=True,
                  activation="tanh",
                  name="decoder",
                  input_shape=(self.input_dim[0],self.units))
    
    self.dense = layers.TimeDistributed(layers.Dense(self.input_dim[1]))
    
  def call(self, x):
    output, hs, cs = self.encoder(x)
    encoded_state = [hs, cs] # see https://www.tensorflow.org/guide/keras/rnn  
    decoded = self.decoder(output, initial_state=encoded_state)
    output_decoder = self.dense(decoded)

    return output_decoder

Model 2:

class Seq2Seq_Detector(Model):
  def __init__(self, flight_len, param_len, hidden_state=16):
    super(Seq2Seq_Detector, self).__init__()
    self.input_dim = (flight_len, param_len)
    self.units = hidden_state
    self.encoder = layers.LSTM(self.units,
                  return_state=True,
                  return_sequences=False,
                  activation="tanh",
                  name='encoder',
                  input_shape=self.input_dim)
    
    self.repeat = layers.RepeatVector(self.input_dim[0])
    
    self.decoder = layers.LSTM(self.units,
                  return_sequences=True,
                  activation="tanh",
                  name="decoder",
                  input_shape=(self.input_dim[0],self.units))
    
    self.dense = layers.TimeDistributed(layers.Dense(self.input_dim[1]))
    
  def call(self, x):
    output, hs, cs = self.encoder(x)
    encoded_state = [hs, cs] # see https://www.tensorflow.org/guide/keras/rnn 
    repeated_vec = self.repeat(output)
    decoded = self.decoder(repeated_vec, initial_state=encoded_state)
    output_decoder = self.dense(decoded)

    return output_decoder

I fitted this 2 models for 200 Epochs on a sample of data (89, 1500, 77) each input being a 2D aray of (1500, 77). And the test data (10,1500,77). Both model had only 16 units.

Here or the results of the autoencoder on one features of the test data.

Results Model 1: (black line is truth, red in reconstructed image)

enter image description here

Results Model 2:

enter image description here

I understand the second one is more restrictive since all the information from the input sequence is compressed into one step, but I'm still surprise that it's barely able to do better than predict the average.

On the other hand, I feel Model 1 tends to be more "influenced" by new data without giving back the input. see example below of Model 1 having a flat line as input :

enter image description here

PS : I know it's not a lot of data for that kind of model, I have much more available but at this stage I'm just experimenting and trying to build my understanding.

PS 2 : Neither models overfitted their data and the training and validation curve are almost text book like.

Why is there such a gap in term of behavior?

2
  • How did you split train/test sets? Randomly or based on the time series? Commented Dec 22, 2020 at 5:14
  • @jonnor split respect the sequence, data is sorted then train is based on the first 90% and test out of the remaining 10%. Commented Dec 22, 2020 at 13:15

1 Answer 1

2

In model 1, each point of 77 features is compressed and decompressed this way: 77->16->16->77 plus some info from the previous steps. It seems that replacing LSTMs with just TimeDistributed(Dense(...)) may also work in this case, but cannot say for sure as I don't know the data. The third image may become better.

What predicts model 2 usually happens when there is no useful signal in the input and the best thing model can do (well, optimize to do) is just to predict the mean target value of the training set.

In model 2 you have:

...
    self.encoder = layers.LSTM(self.units,
                  return_state=True,
                  return_sequences=False,
...

and then

    self.repeat = layers.RepeatVector(self.input_dim[0])

So, in fact, when it does

    repeated_vec = self.repeat(output)
    decoded = self.decoder(repeated_vec, initial_state=encoded_state)

it just takes only one last output from the encoder (which in this case represents the last step of 1500), copies it 1500 times (input_dim[0]), and tries to predict all 1500 values from the information about a couple of last ones. Here is where the model loses most of the useful signal. It does not have enough/any information about the input, and the best thing it can learn in order to minimize the loss function (which I suppose in this case is MSE or MAE) is to predict the mean value for each of the features.

Also, a seq to seq model usually passes a prediction of a decoder step as an input to the next decoder step, in the current case, it is always the same value.

TL;DR 1) seq-to-seq is not the best model for this case; 2) due to the bottleneck it cannot really learn to do anything better than just to predict the mean value for each feature.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, I understand Model 1 is just compressing the input along the features whereas Model 2 is compressing along both dimensions (time, features) hence a much more compressed internal representation for the decoder to work with. And since it's a rather long sequence, I'm probably loosing too much information as you said.
Would you recommend using a model with "attention" in order to avoid the bottleneck ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.