Hello everyone, I am performing k-means clustering for data within a text file which as about 50k samples and each sample is of 128 dimension.
Example of my input:
[1,1,0,0,0,0,1,0,24,3,0,0,0,0,86,149,149,14,0,0,0,0,32,149,46,16,0,0,1,13,3,33,65,66,0,0,0,0,0,2,149,140,6,0,0,2,62,148,88,24,26,2,0,14,116,148,30,15,1,0,0,1,5,30,56,18,0,0,0,0,0,4,149,46,40,14,0,0,1,34,31,46,149,31,0,2,9,12,1,7,8,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,12,2,0,0,0,0,0,0,0,0,0,0,0,0]
(likewise 50k samples)
When I use say about 20-30 lines of this input in this code,
from sklearn.cluster import MiniBatchKMeans
import numpy
import csv
f = open("sample_input.txt", "r")
out = [eval(arr) for arr in f.readlines()]
mbkm = MiniBatchKMeans(init='k-means++', n_clusters=50, batch_size=50,
n_init=10, max_no_improvement=10, verbose=0)
mbkm.fit(out)
mbk_means_cluster_centers = mbkm.cluster_centers_
numpy.set_printoptions(threshold=numpy.nan)
print mbk_means_cluster_centers
I get the output. But when I use the entire file (Be it in text or csv extension), I get the error " setting an array element with a sequence".
When my code is working for 20-30 lines why is it not working for 50k lines of input? I assume the csv conversion of text file is just by renaming the file with .csv extension.
The main doubt is how to get this code running for 50k lines of input? Only when this is resolved, I can run it for another data which has about 3,00,000 lines of input. Please help. Thanks in advance!
PS: I am coding in python 2.7 in ubuntu platform.