Python: how to do this list comprehension with multiple nested lists?

Question

I'm processing text that I need to break up into a list of sentence tokens, which are themselves broken down into word tokens. For example:

raw_text = "the cat in the hat.  green eggs and ham.  one fish two fish."

I also have a list of stopwords that I want to remove from the text:

stopwords = ['the', 'and', 'in']

I'm doing the list comprehension using the nltk module:

from nlkt import sent_tokenize, word_tokenize

sentence_tokens = [word_tokenize(sentence) for sentence in sent_tokenize(raw_text)]

This yields the following:

[['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]

I can filter out the stopwords with nested for loops:

for sentences in sentence_tokens:
    for word in sentences:
        if word in stop:
            sentences.remove(word)

What I'm having trouble doing is combining these all into a single list comprehension so it's a bit cleaner. Any advice? Thanks!

Padraic Cunningham · Accepted Answer · 2015-12-20 01:09:41Z

2

Make stopword a set, you can then use a list comp to filter out the words from each sublist that are in the set of stopwords:

stopwords = {'the', 'and', 'in'}


l = [['the', 'cat', 'in', 'the', 'hat', '.'], ['green', 'eggs', 'and', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]


l[:] = [[word for word in sub if word not in stopwords] for sub in l]

Output:

[['cat', 'hat', '.'], ['green', 'eggs', 'ham', '.'], ['one', 'fish', 'two', 'fish', '.']]

Using l[:] means we will mutate the original object/list, if we broke it up into a for loop:

# for each sublist in l
for sub in l:
    # for each word in the sublist, keep it only if it is not in stopwords 
    sub[:] =  [word for word in sub if word not in stopwords]

Your own code also has a bug, you should never iterate over and mutate a list by removing elements, you would need to make a copy or we could also use reversed:

for sentences in l:
    for word in reversed(sentences):
        if word in stopwords:
            sentences.remove(word)

When you remove an element starting from the left, you can end up removing the wrong elements as what a certain pointer was pointing to when the loop started may not be the same so on future removes you can remove the wrong element.

edited Dec 20, 2015 at 1:09

answered Dec 20, 2015 at 0:53

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

antikantian Over a year ago

Awesome! Thanks a lot for the explanation, and also for pointing out the error in my previous code. It's very much appreciated! I'm pretty new to python and I probably wouldn't have figured that out!

vasanthi vuppuluri · Accepted Answer · 2015-12-20 05:31:27Z

0

Tip: NLTK is not required for this task. A simple Python logic will do. Here is the cleaner way to remove stop-words from text. I'm using Python 2.7 here.

When you want a string instead of list of words:

raw_text = "the cat in the hat.  green eggs and ham.  one fish two fish."
stopwords = ['the', 'and', 'in']
clean_text = " ".join(word for word in raw_text.split() if word not in stopwords)

When you want a list of words:

raw_text = "the cat in the hat.  green eggs and ham.  one fish two fish."
stopwords = ['the', 'and', 'in']
clean_list = [word for word in raw_text.split() if word not in stopwords]

answered Dec 20, 2015 at 5:31

vasanthi vuppuluri

313 bronze badges

1 Comment

Jim O. Over a year ago

OP has a nested list.

Collectives™ on Stack Overflow

Python: how to do this list comprehension with multiple nested lists?

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related