1

I am new to both python and scikit-learn, I am going to cluster bunch of text files ( body of NEWS) , I am using the following code :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import nltk, sklearn, string, os
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans

# Preprocessing text with NLTK package
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
###########################################################################
# Loading and preprocessing text data
print("\n Loading text dataset:")
path = 'n'

for subdir, dirs, files in (os.walk(path)):
    for i,f in enumerate(files):
        if f != '.DS_Store':
                file_path = subdir + os.path.sep + f
                shakes = open(file_path, 'r')
                text = shakes.read()
                lowers = text.lower()
                no_punctuation = lowers.translate(string.punctuation)
                token_dict[f] = no_punctuation
###########################################################################
true_k = 3 # *
print("\n Performing stemming and tokenization...")
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
                              stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
print("n_samples: %d, n_features: %d" % X.shape)
print()
###############################################################################
# Do the actual clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
print(km)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

This code is getting the top words. But what document it is and how can I know which original text files belongs to cluster0, cluster1 or cluster2?

1
  • The cluster membership is stored in km.labels_ and you can also get it using km.predict(X). Commented Nov 26, 2014 at 23:33

1 Answer 1

2

To explain a bit more--you can store the cluster allocations using the follow:

clusters = km.labels_.tolist()

This list will be ordered the same as the dict you passed to your vectorizer.

I just put together a guide to document clustering you might find helpful. Let me know if I can explain anything in more detail: http://brandonrose.org/clustering

Sign up to request clarification or add additional context in comments.

3 Comments

Nice post! I see that you have used K-means for clustering and cosine distance for calculating similarity. Is that not a problem? Would it not be better if the vectorizer results are normalized, which would make KMeans behave as spherical k-means?
I'm downvoting due to python 2.7 code in response to 3.x post
@Schalton fair enough. Feel free to fork the repo and update it to 3.whatever if you'd like to help github.com/brandomr/document_cluster

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.