3

I'm processing text that I need to break up into a list of sentence tokens, which are themselves broken down into word tokens. For example:

raw_text = "the cat in the hat.  green eggs and ham.  one fish two fish."

I also have a list of stopwords that I want to remove from the text:

stopwords = ['the', 'and', 'in']

I'm doing the list comprehension using the nltk module:

from nlkt import sent_tokenize, word_tokenize

sentence_tokens = [word_tokenize(sentence) for sentence in sent_tokenize(raw_text)]

This yields the following:

[['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]

I can filter out the stopwords with nested for loops:

for sentences in sentence_tokens:
    for word in sentences:
        if word in stop:
            sentences.remove(word)

What I'm having trouble doing is combining these all into a single list comprehension so it's a bit cleaner. Any advice? Thanks!

2 Answers 2

2

Make stopword a set, you can then use a list comp to filter out the words from each sublist that are in the set of stopwords:

stopwords = {'the', 'and', 'in'}


l = [['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]


l[:] = [[word for word in sub if word not in stopwords] for sub in l]

Output:

[['cat', 'hat', '.'], ['green', 'eggs', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]

Using l[:] means we will mutate the original object/list, if we broke it up into a for loop:

# for each sublist in l
for sub in l:
    # for each word in the sublist, keep it only if it is not in stopwords 
    sub[:] =  [word for word in sub if word not in stopwords]

Your own code also has a bug, you should never iterate over and mutate a list by removing elements, you would need to make a copy or we could also use reversed:

for sentences in l:
    for word in reversed(sentences):
        if word in stopwords:
            sentences.remove(word)

When you remove an element starting from the left, you can end up removing the wrong elements as what a certain pointer was pointing to when the loop started may not be the same so on future removes you can remove the wrong element.

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome! Thanks a lot for the explanation, and also for pointing out the error in my previous code. It's very much appreciated! I'm pretty new to python and I probably wouldn't have figured that out!
0

Tip: NLTK is not required for this task. A simple Python logic will do. Here is the cleaner way to remove stop-words from text. I'm using Python 2.7 here.

When you want a string instead of list of words:

raw_text = "the cat in the hat.  green eggs and ham.  one fish two fish."
stopwords = ['the', 'and', 'in']
clean_text = " ".join(word for word in raw_text.split() if word not in stopwords)

When you want a list of words:

raw_text = "the cat in the hat.  green eggs and ham.  one fish two fish."
stopwords = ['the', 'and', 'in']
clean_list = [word for word in raw_text.split() if word not in stopwords]

1 Comment

OP has a nested list.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.