document clustering in python

Question

I am new to both python and scikit-learn, I am going to cluster bunch of text files ( body of NEWS) , I am using the following code :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import nltk, sklearn, string, os
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans

# Preprocessing text with NLTK package
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
###########################################################################
# Loading and preprocessing text data
print("\n Loading text dataset:")
path = 'n'

for subdir, dirs, files in (os.walk(path)):
    for i,f in enumerate(files):
        if f != '.DS_Store':
                file_path = subdir + os.path.sep + f
                shakes = open(file_path, 'r')
                text = shakes.read()
                lowers = text.lower()
                no_punctuation = lowers.translate(string.punctuation)
                token_dict[f] = no_punctuation
###########################################################################
true_k = 3 # *
print("\n Performing stemming and tokenization...")
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
                              stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
print("n_samples: %d, n_features: %d" % X.shape)
print()
###############################################################################
# Do the actual clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
print(km)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

This code is getting the top words. But what document it is and how can I know which original text files belongs to cluster0, cluster1 or cluster2?

The cluster membership is stored in km.labels_ and you can also get it using km.predict(X). — Andreas Mueller
– Andreas Mueller, Commented Nov 26, 2014 at 23:33

brandomr · Accepted Answer · 2014-12-25 14:39:29Z

2

To explain a bit more--you can store the cluster allocations using the follow:

clusters = km.labels_.tolist()

This list will be ordered the same as the dict you passed to your vectorizer.

I just put together a guide to document clustering you might find helpful. Let me know if I can explain anything in more detail: http://brandonrose.org/clustering

answered Dec 25, 2014 at 14:39

brandomr

3021 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Pradeep Vairamani Over a year ago

Nice post! I see that you have used K-means for clustering and cosine distance for calculating similarity. Is that not a problem? Would it not be better if the vectorizer results are normalized, which would make KMeans behave as spherical k-means?

Schalton Over a year ago

I'm downvoting due to python 2.7 code in response to 3.x post

brandomr Over a year ago

@Schalton fair enough. Feel free to fork the repo and update it to 3.whatever if you'd like to help github.com/brandomr/document_cluster

Collectives™ on Stack Overflow

document clustering in python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related