2

Say I have the following numpy array:

a = np.arange(20)

And also an array containing indices as follows:

ix = np.array([4,10,15])

I've been trying to come up with a vectorized solution to the following question: How can I apply a function along a being splitted using the indices in ix?

So say I where to split a with np.split (I'm only using np.split here to illustrate the groups to which I would like to apply a function here):

np.split(a,ix)

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

And say for instance I'd like to take the sum on each chunk, so giving:

[6, 39, 60, 85]

How could I vectorize this using numpy?

11
  • I don't think you can vectorize your ops on that kind of array. Can you use pandas? Commented Feb 5, 2019 at 18:23
  • Well yes I could but looking for a numpy solution for a better performance Commented Feb 5, 2019 at 18:24
  • You are looking for sum only or for a general solution? Commented Feb 5, 2019 at 18:26
  • 1
    Take a look at this solution: stackoverflow.com/a/32043366/4001592 Commented Feb 5, 2019 at 18:45
  • 1
    It wasn't that one in the case im thinking of but that guy has some serious tricks up his sleeve so it's a decent bet he has solved something similar Commented Feb 5, 2019 at 18:47

3 Answers 3

2

I do not know if this is the best solution, but you could convert the list of arrays with different sizes to list of array of fixed size by adding zeros. And then implement a function like sum that does not get affected by zeros.

See an example below.

a = np.arange(20)
ix = np.array([4,10,15])
b = np.split(a,ix)
print(b)

results in

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

Then use itertools to convert list to array from here

import itertools
c = np.array(list(itertools.zip_longest(*b, fillvalue=0))).T
print(c)

which results in

[[ 0  1  2  3  0  0]
 [ 4  5  6  7  8  9]
 [10 11 12 13 14  0]
 [15 16 17 18 19  0]]

then sum it using

np.sum(c, axis = 1)

results in

array([ 6, 39, 60, 85])
Sign up to request clarification or add additional context in comments.

5 Comments

Pretty nice solution even though it's not numpy based, thanks for sharing!
Don't dismiss zip_longest just because it doesn't have a np. prefix! It's one the nicer ways of padding lists to a uniform length.
Yes I think this solution is indeed nice! I'm only stressing on that fact because I did explicitly mention that I was looking for a numpy solution
But why do you want a numpy solution?
Because I would be applying the solution to n arrays as a and ix. So ideally I would like to stack them together and perform vectorized operations over them. I strart of with a 3dim array, perform some operations and end up with these 2dim arrays a and ix. Made it a 1dim array here for simplicity. But it should be the same if the solution is numpy based. If python loops are used it won't scale as well.
1

A pandas solution will be:

import numpy as np
import pandas as pd

a = np.arange(20)

ix = np.array([4, 10, 15])

data = pd.DataFrame(np.split(a, ix))

print(np.nansum(data, axis=1))

Output

[ 6. 39. 60. 85.]

1 Comment

This works because DataFrame pads the short rows with NaN. But creating frame is a very slow process. My list comprehension is 30x faster.
1

split produces a list of arrays, which may differ in length. It actually does so iteratively

In [12]: alist = []
In [13]: alist.append(a[0:idx[0]])
In [14]: alist.append(a[idx[0]:idx[1]])
In [15]: alist.append(a[idx[1]:idx[2]])
....

Applying sum to each element of the list individually makes sense:

In [11]: [np.sum(row) for row in alist]
Out[11]: [6, 39, 60, 85]

When you have a list of arrays that differ in shape, it's a good bet that you'll have to do a Python level iteration on it.

Fast 'vectorize' means performing the calculations in compiled code. Most that is built around multidimensional arrays, e.g. 2d ones. If your split had produced equal size array, you could use np.sum with the appropriate axis parameter.

In [23]: a1 = a.reshape(4,5)
In [24]: a1
Out[24]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
In [25]: np.sum(a1, axis=1)
Out[25]: array([10, 35, 60, 85])

Sometimes we can play tricks to cast the problem into a n-d one, for example if your first array of the split were padded with a 0. But that casting itself might require iteration.

As raised here (and its links) Origin of AttributeError: object has no attribute 'cos' math (ufunc) functions applied to object dtype arrays, ends up delegating the action to corresponding methods of the objects. But that still involves a (near)Python level iteration over the objects.


Some timings:

In [57]: timeit [np.sum(row) for row in alist]
31.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [58]: timeit np.sum(list(itertools.zip_longest(*alist, fillvalue=0)),axis=0)
25.2 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [59]: timeit np.nansum(pd.DataFrame(alist), axis=1)
908 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [61]: timeit np.frompyfunc(sum,1,1)(alist)
12.9 µs ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this last case the Python sum is faster than than np.sum. But that's true with the list comprehension as well:

In [63]: timeit [sum(row) for row in alist]
6.86 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And with Divakar's wiz-bang fillna, Numpy: Fix array with rows of different lengths by filling the empty elements with zeros

In [70]: timeit numpy_fillna(np.array(alist)).sum(axis=1)
44.2 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Once you have a multidimensional array, the numpy code is fast. But if start with a list, even a list of arrays, Python list methods often are faster. The time it takes to construct an array (or Dataframe) is never trivial.

2 Comments

Yes I'm aware that each array may differ in length. No. Split here was to make the example easy to understand. Just wondering if there was an alternative to python level iteration
The core of true vectorization is to perform the action in compiled code, using the standard numpy building blocks. In most cases that requires rectangular multidimensional arrays.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.