1

Hello everyone, I am performing k-means clustering for data within a text file which as about 50k samples and each sample is of 128 dimension.

Example of my input:

[1,1,0,0,0,0,1,0,24,3,0,0,0,0,86,149,149,14,0,0,0,0,32,149,46,16,0,0,1,13,3,33,65,66,0,0,0,0,0,2,149,140,6,0,0,2,62,148,88,24,26,2,0,14,116,148,30,15,1,0,0,1,5,30,56,18,0,0,0,0,0,4,149,46,40,14,0,0,1,34,31,46,149,31,0,2,9,12,1,7,8,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,12,2,0,0,0,0,0,0,0,0,0,0,0,0]

(likewise 50k samples)

When I use say about 20-30 lines of this input in this code,

from sklearn.cluster import MiniBatchKMeans
import numpy 
import csv

f = open("sample_input.txt", "r") 
out = [eval(arr) for arr in f.readlines()]


mbkm = MiniBatchKMeans(init='k-means++', n_clusters=50, batch_size=50,
                       n_init=10, max_no_improvement=10, verbose=0)
mbkm.fit(out)
mbk_means_cluster_centers = mbkm.cluster_centers_

numpy.set_printoptions(threshold=numpy.nan)
print mbk_means_cluster_centers

I get the output. But when I use the entire file (Be it in text or csv extension), I get the error " setting an array element with a sequence".

When my code is working for 20-30 lines why is it not working for 50k lines of input? I assume the csv conversion of text file is just by renaming the file with .csv extension.

The main doubt is how to get this code running for 50k lines of input? Only when this is resolved, I can run it for another data which has about 3,00,000 lines of input. Please help. Thanks in advance!

PS: I am coding in python 2.7 in ubuntu platform.

2
  • You could try dividing the file in half and seeing if the error occurs on only one half to see if the problem is in the data. This might also help to discover any file size limitation. Commented Jan 21, 2015 at 16:39
  • I guess the problem is in the data. But I am not able to find out where. I tried running the code with fragments of input. For certain parts the code is running and for certain parts it is not. The data looks fine to me but I don't what is going wrong where. @JamieBull Commented Jan 21, 2015 at 16:55

1 Answer 1

2

It looks like you have two or more lists on a line somewhere meaning you're trying to evaluate two or more arrays (a sequence) as a single array. When I test this with two arrays separated by a comma then I get the same error as you.

Try this to find the error:

f = open("sample_input.txt", "r") 
n = 1
for line in f.readlines():
    if len(eval(line)) is not 128:
        print "Error is on line %s" % n
    n += 1

Otherwise, I suggested "divide and conquer". If you split the data in half and there's a problem in one half, split that again and keep going until you have only a small chunk of file with the problem. The problem may be in more than one place, which means it could take a while but it still seems like the best way to approach the problem if it's not what I suggested.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.