Apply function along ranges in numpy array

Question

Say I have the following numpy array:

a = np.arange(20)

And also an array containing indices as follows:

ix = np.array([4,10,15])

I've been trying to come up with a vectorized solution to the following question: How can I apply a function along a being splitted using the indices in ix?

So say I where to split a with np.split (I'm only using np.split here to illustrate the groups to which I would like to apply a function here):

np.split(a,ix)

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

And say for instance I'd like to take the sum on each chunk, so giving:

[6, 39, 60, 85]

How could I vectorize this using numpy?

I don't think you can vectorize your ops on that kind of array. Can you use pandas? — Dani Mesejo
– Dani Mesejo, Commented Feb 5, 2019 at 18:23
Well yes I could but looking for a numpy solution for a better performance — yatu
– yatu, Commented Feb 5, 2019 at 18:24
Take a look at this solution: stackoverflow.com/a/32043366/4001592 — Dani Mesejo
– Dani Mesejo, Commented Feb 5, 2019 at 18:45
It wasn't that one in the case im thinking of but that guy has some serious tricks up his sleeve so it's a decent bet he has solved something similar — roganjosh
– roganjosh, Commented Feb 5, 2019 at 18:47

plasmon360 · Accepted Answer · 2019-02-05 18:28:05Z

2

I do not know if this is the best solution, but you could convert the list of arrays with different sizes to list of array of fixed size by adding zeros. And then implement a function like sum that does not get affected by zeros.

See an example below.

a = np.arange(20)
ix = np.array([4,10,15])
b = np.split(a,ix)
print(b)

results in

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

Then use itertools to convert list to array from here

import itertools
c = np.array(list(itertools.zip_longest(*b, fillvalue=0))).T
print(c)

which results in

[[ 0  1  2  3  0  0]
 [ 4  5  6  7  8  9]
 [10 11 12 13 14  0]
 [15 16 17 18 19  0]]

then sum it using

np.sum(c, axis = 1)

results in

array([ 6, 39, 60, 85])

answered Feb 5, 2019 at 18:28

plasmon360

4,1991 gold badge21 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

yatu Over a year ago

Pretty nice solution even though it's not numpy based, thanks for sharing!

hpaulj Over a year ago

Don't dismiss zip_longest just because it doesn't have a np. prefix! It's one the nicer ways of padding lists to a uniform length.

yatu Over a year ago

Yes I think this solution is indeed nice! I'm only stressing on that fact because I did explicitly mention that I was looking for a numpy solution

hpaulj Over a year ago

But why do you want a numpy solution?

yatu Over a year ago

Because I would be applying the solution to n arrays as a and ix. So ideally I would like to stack them together and perform vectorized operations over them. I strart of with a 3dim array, perform some operations and end up with these 2dim arrays a and ix. Made it a 1dim array here for simplicity. But it should be the same if the solution is numpy based. If python loops are used it won't scale as well.

Dani Mesejo · Accepted Answer · 2019-02-05 18:37:22Z

1

A pandas solution will be:

import numpy as np
import pandas as pd

a = np.arange(20)

ix = np.array([4, 10, 15])

data = pd.DataFrame(np.split(a, ix))

print(np.nansum(data, axis=1))

Output

[ 6. 39. 60. 85.]

answered Feb 5, 2019 at 18:37

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

1 Comment

hpaulj Over a year ago

This works because DataFrame pads the short rows with NaN. But creating frame is a very slow process. My list comprehension is 30x faster.

hpaulj · Accepted Answer · 2019-02-05 20:07:06Z

split produces a list of arrays, which may differ in length. It actually does so iteratively

In [12]: alist = []
In [13]: alist.append(a[0:idx[0]])
In [14]: alist.append(a[idx[0]:idx[1]])
In [15]: alist.append(a[idx[1]:idx[2]])
....

Applying sum to each element of the list individually makes sense:

In [11]: [np.sum(row) for row in alist]
Out[11]: [6, 39, 60, 85]

When you have a list of arrays that differ in shape, it's a good bet that you'll have to do a Python level iteration on it.

Fast 'vectorize' means performing the calculations in compiled code. Most that is built around multidimensional arrays, e.g. 2d ones. If your split had produced equal size array, you could use np.sum with the appropriate axis parameter.

In [23]: a1 = a.reshape(4,5)
In [24]: a1
Out[24]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
In [25]: np.sum(a1, axis=1)
Out[25]: array([10, 35, 60, 85])

Sometimes we can play tricks to cast the problem into a n-d one, for example if your first array of the split were padded with a 0. But that casting itself might require iteration.

As raised here (and its links) Origin of AttributeError: object has no attribute 'cos' math (ufunc) functions applied to object dtype arrays, ends up delegating the action to corresponding methods of the objects. But that still involves a (near)Python level iteration over the objects.

Some timings:

In [57]: timeit [np.sum(row) for row in alist]
31.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [58]: timeit np.sum(list(itertools.zip_longest(*alist, fillvalue=0)),axis=0)
25.2 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [59]: timeit np.nansum(pd.DataFrame(alist), axis=1)
908 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [61]: timeit np.frompyfunc(sum,1,1)(alist)
12.9 µs ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this last case the Python sum is faster than than np.sum. But that's true with the list comprehension as well:

In [63]: timeit [sum(row) for row in alist]
6.86 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And with Divakar's wiz-bang fillna, Numpy: Fix array with rows of different lengths by filling the empty elements with zeros

In [70]: timeit numpy_fillna(np.array(alist)).sum(axis=1)
44.2 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Once you have a multidimensional array, the numpy code is fast. But if start with a list, even a list of arrays, Python list methods often are faster. The time it takes to construct an array (or Dataframe) is never trivial.

Yes I'm aware that each array may differ in length. No. Split here was to make the example easy to understand. Just wondering if there was an alternative to python level iteration
The core of true vectorization is to perform the action in compiled code, using the standard numpy building blocks. In most cases that requires rectangular multidimensional arrays.

Collectives™ on Stack Overflow

Apply function along ranges in numpy array

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related