2

I am working on an application that requires me to extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following steps:

  1. Tokenize each raw conversation (output stored as List of List of strings)
  2. Remove stop words
  3. Use stemmer (Porter stemming algorithm)

Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?

Thanks Nihit

1
  • A promising approach to ranking terms, specifically for generating word clouds, are parsimonious language models. See my implementation at github.com/larsmans/weighwords (WIP) Commented Jun 8, 2011 at 12:09

1 Answer 1

3

I guess it depends on your definition of "important". If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.

Something like (not tested):

from collections import defaultdict

#Collect word statistics
counts = defaultdict(int) 
for sent in stemmed_sentences:
   for stem in sent:
      counts[stem] += 1

#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]

#Sort (stem,count) pairs based on count 
sorted_stems = sorted(pairs, key = lambda x: x[1])
Sign up to request clarification or add additional context in comments.

2 Comments

... and you can try to penalize all-too-common words with idf, although user studies have shown tf clouds to be preferred to tf-idf ones. +1.
Also look into information-gain metrics and significance testing. nltk.metrics provides some good functions in that area.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.