How to resolve setting an array element with a sequence error while performing K-Means clustering?

Question

Hello everyone, I am performing k-means clustering for data within a text file which as about 50k samples and each sample is of 128 dimension.

Example of my input:

[1,1,0,0,0,0,1,0,24,3,0,0,0,0,86,149,149,14,0,0,0,0,32,149,46,16,0,0,1,13,3,33,65,66,0,0,0,0,0,2,149,140,6,0,0,2,62,148,88,24,26,2,0,14,116,148,30,15,1,0,0,1,5,30,56,18,0,0,0,0,0,4,149,46,40,14,0,0,1,34,31,46,149,31,0,2,9,12,1,7,8,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,12,2,0,0,0,0,0,0,0,0,0,0,0,0]

(likewise 50k samples)

When I use say about 20-30 lines of this input in this code,

from sklearn.cluster import MiniBatchKMeans
import numpy 
import csv

f = open("sample_input.txt", "r") 
out = [eval(arr) for arr in f.readlines()]


mbkm = MiniBatchKMeans(init='k-means++', n_clusters=50, batch_size=50,
                       n_init=10, max_no_improvement=10, verbose=0)
mbkm.fit(out)
mbk_means_cluster_centers = mbkm.cluster_centers_

numpy.set_printoptions(threshold=numpy.nan)
print mbk_means_cluster_centers

I get the output. But when I use the entire file (Be it in text or csv extension), I get the error " setting an array element with a sequence".

When my code is working for 20-30 lines why is it not working for 50k lines of input? I assume the csv conversion of text file is just by renaming the file with .csv extension.

The main doubt is how to get this code running for 50k lines of input? Only when this is resolved, I can run it for another data which has about 3,00,000 lines of input. Please help. Thanks in advance!

PS: I am coding in python 2.7 in ubuntu platform.

You could try dividing the file in half and seeing if the error occurs on only one half to see if the problem is in the data. This might also help to discover any file size limitation. — Jamie Bull
– Jamie Bull, Commented Jan 21, 2015 at 16:39
I guess the problem is in the data. But I am not able to find out where. I tried running the code with fragments of input. For certain parts the code is running and for certain parts it is not. The data looks fine to me but I don't what is going wrong where. @JamieBull — Sanathana
– Sanathana, Commented Jan 21, 2015 at 16:55

Jamie Bull · Accepted Answer · 2015-01-21 17:11:03Z

2

It looks like you have two or more lists on a line somewhere meaning you're trying to evaluate two or more arrays (a sequence) as a single array. When I test this with two arrays separated by a comma then I get the same error as you.

Try this to find the error:

f = open("sample_input.txt", "r") 
n = 1
for line in f.readlines():
    if len(eval(line)) is not 128:
        print "Error is on line %s" % n
    n += 1

Otherwise, I suggested "divide and conquer". If you split the data in half and there's a problem in one half, split that again and keep going until you have only a small chunk of file with the problem. The problem may be in more than one place, which means it could take a while but it still seems like the best way to approach the problem if it's not what I suggested.

answered Jan 21, 2015 at 17:11

Jamie Bull

13.6k18 gold badges80 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to resolve setting an array element with a sequence error while performing K-Means clustering?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related