I'm processing text that I need to break up into a list of sentence tokens, which are themselves broken down into word tokens. For example:
raw_text = "the cat in the hat. green eggs and ham. one fish two fish."
I also have a list of stopwords that I want to remove from the text:
stopwords = ['the', 'and', 'in']
I'm doing the list comprehension using the nltk module:
from nlkt import sent_tokenize, word_tokenize
sentence_tokens = [word_tokenize(sentence) for sentence in sent_tokenize(raw_text)]
This yields the following:
[['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]
I can filter out the stopwords with nested for loops:
for sentences in sentence_tokens:
for word in sentences:
if word in stop:
sentences.remove(word)
What I'm having trouble doing is combining these all into a single list comprehension so it's a bit cleaner. Any advice? Thanks!