I am currently working with a 2d numpy object array filled with collections.counter objects Each counter is basically a histogram.
- Keys are always from a limited set of integers eg between 0 and 1500
- number of items in each counter is variable, most are small but some have every key
This all works fine for my needs with smaller datasets but with a dataset around the 500 million cells mark the memory use is around 120Gb which is a little high.
Interestingly numpy.save writes it out to a 4gb file which makes me think there is something better i can be doing.
Any suggestions on how i can reduce my memory usage.
I considered a 3d array but because of the amount of empty counts it would have to hold it required even more memory.
I make lots of use of counter.update in constructing the array so any method needs a quick/neat way of getting similar functionality.
The access after the data is created isnt a big issue as long for each cell i can get the value for each key - no need for a dictionaries indexing.
Below is a very simplified example that produces a small dataset that is roughly analogous to what ive described above. My code would have a skew further towards less keys per counter and higher counts per key
def counterArray_init(v):
return collections.Counter([v])
e = np.random.random_integers(0,1500,[10,10])
row_len, col_len = e.shape
counterArray = np.zeros([row_len,col_len], dtype= object)
vinit = np.vectorize(counterArray_init)
counterArray[:,:] = vinit(e)
for row in xrange(1,row_len):
for col in xrange(0,col_len):
counterArray[row,col].update(counterArray[row - 1,col])
return counterArray
Thanks
Edit: I have realised that in my smaller counters the keys used fall within a small range. The random example code above is not a good example of this behaviour. As a result i am investigating using an object array filled with different length int arrays and a separate array that stores the minimum key value for each of those int arrays. It seems like an odd solution but initial testing looks like its using only about 20% of the memory used by the counter method.