Using nltk library to extract keywords

Question

I am working on an application that requires me to extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following steps:

Tokenize each raw conversation (output stored as List of List of strings)
Remove stop words
Use stemmer (Porter stemming algorithm)

Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?

Thanks Nihit

A promising approach to ranking terms, specifically for generating word clouds, are parsimonious language models. See my implementation at github.com/larsmans/weighwords (WIP) — Fred Foo
– Fred Foo, Commented Jun 8, 2011 at 12:09

masterpiga · Accepted Answer · 2011-06-08 12:50:31Z

3

I guess it depends on your definition of "important". If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.

Something like (not tested):

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])

edited Jun 8, 2011 at 12:50

answered Jun 8, 2011 at 11:16

masterpiga

8687 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Fred Foo Over a year ago

... and you can try to penalize all-too-common words with idf, although user studies have shown tf clouds to be preferred to tf-idf ones. +1.

Jacob Over a year ago

Also look into information-gain metrics and significance testing. nltk.metrics provides some good functions in that area.

Collectives™ on Stack Overflow

Using nltk library to extract keywords

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related