Split of numpy array into unequal chunks

Question

In my program I fill a large numpy array with elements, number of which I do not know in advance. Since adding single element per go to a numpy array is inefficient, I increase its size by chunks of length 10000 initialized with zeros. This leads to the situation that in the end I have an array with tail of zeros. And what I would like to have is the array, whose length is precisely number of meaningful elements (because later I cannot distinguish junky zeros from actual data points with zero value). Straightforward copying of slicing, however, doubles the RAM consumption, which is really undesirable since my arrays are quite large. I looked into numpy.split functions, but they seem to split arrays only into chunks of the equal size, which of course does not suit me.

I illustrate the problem with the following code:

import numpy, os, random

def check_memory(mode_peak = True, mark = ''):
    """Function for measuring the memory consumption (Linux only)"""
    pid = os.getpid()
    with open('/proc/{}/status'.format(pid), 'r') as ifile:
        for line in ifile:
            if line.startswith('VmPeak' if mode_peak else 'VmSize'):
                memory = line[: -1].split(':')[1].strip().split()[0]
                memory = int(memory) / (1024 * 1024)
                break
    mode_str = 'Peak' if mode_peak else 'Current'
    print('{}{} RAM consumption: {:.3f} GB'.format(mark, mode_str, memory))

def generate_element():
    """Test element generator"""
    for i in range(12345678):
        yield numpy.array(random.randrange(0, 1000), dtype = 'i4')

check_memory(mode_peak = False, mark = '#1 ')
a = numpy.zeros(10000, dtype = 'i4')
i = 0
for element in generate_element():
    if i == len(a):
        a = numpy.concatenate((a, numpy.zeros(10000, dtype = 'i4')))
    a[i] = element
    i += 1
check_memory(mode_peak = False, mark = '#2 ')
a = a[: i]
check_memory(mode_peak = False, mark = '#3 ')
check_memory(mode_peak = True, mark = '#4 ')

This outputs:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#3 Current RAM consumption: 0.118 GB
#4 Peak RAM consumption: 0.164 GB

Can anyone help me to find a solution that does not penalize significantly runtime or RAM consumption?

Edit:

I tried to use

a = numpy.delete(a, numpy.s_[i: ])

as well as

a = numpy.split(a, (i, ))[0]

However, it results in the same doubled memory consumption

Probably speed is unimportant to you relative to memory, but I don't know how to test memory consumption on my system (mac os x). In any case, it's about 2x faster for me to build a list and then convert to array at the end. Fastest for me (though I don't know how it actually is implemented) is np.fromiter, but I assume your generator is just for testing and not what you're actually using. Also, if you yield scalars instead of arrays (as your element) that will be much faster of course, unless each element will actually have some length in your use case. — askewchan
– askewchan, Commented Aug 30, 2015 at 21:37
@askewchan In my case the array generation is a step in big program that contributes only a tiny fraction of total runtime, therefore the speed is not critical. On the other hand, this step was the memory bottleneck. And the generator is of course much more complex and involves receiving data from the network. — Roman
– Roman, Commented Sep 2, 2015 at 7:38

tmdavison · Accepted Answer · 2015-08-28 12:34:21Z

9

numpy.split does not have to split the array into equal-sized chunks. If you use the indices_or_sections parameter, you can give a list of integers, which it will use to split the array. For example:

>>> x = np.arange(8.0)
>>> np.split(x, [3, 5, 6, 10])
[array([ 0.,  1.,  2.]),   # x[:3]
 array([ 3.,  4.]),        # x[3:5]
 array([ 5.]),             # x[5:6]
 array([ 6.,  7.]),        # x[6:10]
 array([], dtype=float64)] # x[10:]

answered Aug 28, 2015 at 12:34

tmdavison

69.7k13 gold badges204 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Roman Over a year ago

My mistake that I did not understand the syntax of numpy.split correctly. However, this results in the same doubled RAM consumption as slicing: a = numpy.split(a, (i,))[0]

Roman · Accepted Answer · 2015-08-28 15:27:17Z

Finally I figured it out. In fact, extra memory was consumed not only during trimming stage, but also during the concatenation. Therefore, introducing a peak memory check at the point #2 outputs:

#2 Peak RAM consumption: 0.164 GB

However, there is the resize() method, which changes the size/shape of an array in-place:

check_memory(mode_peak = False, mark = '#1 ')
page_size = 10000
a = numpy.zeros(page_size, dtype = 'i4')
i = 0
for element in generate_element():
    if (i != 0) and (i % page_size == 0):
        a.resize(i + page_size)
    a[i] = element
    i += 1
a.resize(i)
check_memory(mode_peak = False, mark = '#2 ')
check_memory(mode_peak = True, mark = '#2 ')

This leads to output:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#2 Peak RAM consumption: 0.118 GB

In addition, as there are no more reallocations, the performance improved significantly as well.

Collectives™ on Stack Overflow

Split of numpy array into unequal chunks

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related