2

I have the following preprocessing for a tensorflow neural-network:

import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.layers import Input,Dense,LSTM,Flatten,GlobalAveragePooling1D,Embedding,Dropout

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O /tmp/bbc-text.csv



# Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
# Convert it to a Python list and paste it here
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
             "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do",
             "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having",
             "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how",
             "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself",
             "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought",
             "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should",
             "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then",
             "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through",
             "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were",
             "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why",
             "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself",
             "yourselves"]

#----------------------------------- Ream from Csv and remove the stopwords
sentences = []
labels = []
with open("/tmp/bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentence = row[1]
        for word in stopwords:
            token = " " + word + " "
            sentence = sentence.replace(token, " ")
            sentence = sentence.replace(" ", " ")
        sentences.append(sentence)


#----------------------------------  Tokenize sentences
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding = 'post')

#--------------------------------- Tokenize labels
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
# label_word_index = label_tokenizer.word_index
label_seq = label_tokenizer.texts_to_sequences(labels)`

and finally here is the neural network which works by the prepared data:

train_sentence = tf.convert_to_tensor(padded,tf.int32)
train_label = tf.convert_to_tensor(label_seq,tf.int32)

input = Input(shape=(2441,))
x = Embedding(input_dim=10000,output_dim=128)(input)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = Dropout(0.2)(x)
x = LSTM(64)(x)
x = Flatten()(x)
output = Dense(5, activation='softmax')(x)
model = tf.keras.models.Model(input,output)

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(x=train_sentence,y=train_label,epochs=10)

But, it fails with the following error:

InvalidArgumentError: Graph execution error:
2
  • @AloneTogether Can you please help me ? Commented May 4, 2022 at 19:25
  • 1
    Will take a look Commented May 4, 2022 at 20:07

1 Answer 1

1

The input_dim of the Embedding layer has to correspond to the size of your data's vocabulary + 1. Also, your labels should begin from zero and not from one when using the sparse_categorical_crossentropy loss function. Here is a working example based on your code and data:

# ...
# ...
train_sentence = tf.convert_to_tensor(padded,tf.int32)
train_label = tf.convert_to_tensor(label_seq,tf.int32)
train_label = train_label - 1

input = Input(shape=(2441,))
x = Embedding(input_dim=len(tokenizer.word_index) + 1,output_dim=128)(input)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = LSTM(64,return_sequences=True)(x)
x = Dropout(0.2)(x)
x = LSTM(64)(x)
x = Flatten()(x)
output = Dense(5, activation='softmax')(x)
model = tf.keras.models.Model(input,output)

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(x=train_sentence,y=train_label,epochs=10)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks alot. But, if you do label_word_index = label_tokenizer.word_index and then check the values of label_word_index then it does not begin from 0, but still it works well. Can you please explain it ?
It is reserved for words not in label_word_index

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.