1

I have a structure like this:

a = np.array([[np.array([1,2,3]), np.array([2])], [np.array([0,1,2,3,4]), np.array([0,4])]])

it is a 2x2 structure, every element can have arbitrary dimension

I want to get the mean of every element of a:

np.mean(a)

ValueError: operands could not be broadcast together with shapes (3) (5) 

I tried to play with axis, but it doesn't work. I want to get a new np.array equal to [[2, 2], [2, 2].

In general I want to be able to run any vectorized function on a in the same way.

How to do it? I need fast code, so please avoid explicit loops.

The best I can do is:

f = np.mean
result = np.zeros(a.shape)
for i in np.ndindex(a.shape):
  result[i] = f(a[i])
3
  • 1
    What exactly do you mean by "fast code"? If a is always going to be 2x2, but the elements of it can be huge, doing a slow outer 2x2 loop doesn't make any difference, as long as you're still doing fast iteration over the inner loops (which np.mean does). Commented Feb 27, 2013 at 1:26
  • fast means "fast as possibile". a is an N x M x ... matrix with N,M,... ~ 10, dimensions should be <~ 4. Every element can be quite big. Probably I should try to parallelize the evaluation, using a thread for every matrix element. Commented Feb 27, 2013 at 10:48
  • But you don't make something as fast as possible by optimizing something that takes less than 2% of the total time. Commented Feb 27, 2013 at 18:49

3 Answers 3

3

I guess you want numpy.vectorize which is basically like the builtin map for ndarrays.

>>> a = np.array([[np.array([1,2,3]), np.array([2])], [np.array([0,1,2,3,4]), np.array([0,4])]])
>>> vmean = np.vectorize(np.mean)
>>> vmean(a)
array([[ 2.,  2.],
       [ 2.,  2.]])
Sign up to request clarification or add additional context in comments.

6 Comments

nice! by the way you have to explain me a bit more
The probleme with vectorize is that it gives the illusion of numpy effectiveness, but it performs as a pure python loop.
@Jaime: +1 for pointing that out. It's still useful for simplicity—it's nicer than, e.g., getting an iterator to an array to pass to map and to pass to an array constructor. But it's not going to save any time.
@abanert I disagree on this one: I have a very strong dislike of vectorize since its poor performance broke my heart when I was starting out with numpy. Not sure I will ever get over it!
@Jaime: I'd still rather see code that used vectorize with a big # WARNING: This is no faster than the equivalent loop over ravel, it's just easier to read, than the loop. It's still shorter even with that comment, and, more importantly, more readable and harder to get wrong. But I can understand your aversion.
|
3

This is the best I can do:

a = np.array([[np.array([1,2,3]), np.array([2])], [np.array([0,1,2,3,4]), np.array([0,4])]])
np.array([ np.mean(b) for b in a.ravel()]).reshape(a.shape)

outputs

array([[ 2.,  2.],
       [ 2.,  2.]])

Can be generalized for other functions as:

def ravel_func(func,arr):
    return np.asarray([ func(part) for part in arr.ravel()]).reshape(arr.shape)

Speed test, thanks to Jaime

In [82]: timeit vmean(a)
10000 loops, best of 3: 65 us per loop

In [83]: timeit ravel_func(np.mean,a)
10000 loops, best of 3: 67 us per loop

2 Comments

you are using a list comprehension, not very different from an explicit loop. I would like to use some numpy feature
Nicer, yes, but same speed.
2

From your comment: you have a relatively small number of quite big elements. That means the speed of the outer loop's iteration is irrelevant, while the speed of the inner loops' iterations is crucial.

Let's put some actual numbers on this. Your outer array has up to 4 dimensions of up to size 10. This means there are up to 10000 element. Meanwhile, the elements is "quite big"—let's interpret that conservatively as just 50. So, you've got 510000 loop iterations. Anything you do to improve the speed of the 10000 outer iterations will make less than a 2% difference in your code. In fact, it's far less than that—that 2% assumes there is no work to be done other than the iteration itself, which obviously isn't true.

So, you're focusing on the wrong place. Only the 500000 inner iterations matter. If you could replace the array of arrays with a single one-dimension-higher array and do it all in numpy, it might be faster, but making your code much more complex and hard to understand for a gain on the order of a fraction of 1% is silly. Just use the vectorize answer or comprehension answer, which are simple and obvious.


Meanwhile:

Probably I should try to parallelize the evaluation, using a thread for every matrix element.

Parallelism is a good idea here. But not using threads, and not one per element.

If you have, say, an 8-core machine, using 8 hardware threads means you get things done nearly 8 times as fast. But using 10000 hardware threads does not mean you get things done 10000 times as fast, or even 8 times as fast—you probably spend so much time context-switching, allocating and freeing thread stacks, etc. that it actually takes longer than the sequential version. So, you want to create a worker pool of 8 hardware threads, and stick your 10000 tasks in a queue to be run by that pool.

Also, 10000 is probably way too granular. Each job has a little bit of overhead, and if your jobs are too small, you're wasting too much time on overhead. The obvious ways to batch things up are per axis—have 1000 jobs, each doing one row of 10 elements, or 100 jobs, each doing one 2D array of 100 elements. Testing out batch sizes of 0, 1, 2, and 3 axes and see which gives the best performance. If the differences are huge, you may want to try slightly more complex batching, like splitting into 3D arrays of 3x10x10 and 4x10x10.

Finally, while Python threads are real OS (and therefore hardware) threads, the GIL prevents more than one of them from running Python code at a time. numpy does some work to get around this, but it's not perfect (especially if you have a lot of overhead between the numpy calls). The simplest way around this is to use multiprocessing, which lets you have a pool of 8 separate processes, instead of 8 threads in 1 process. This ramps up the overhead significantly—you either need to copy the sub-arrays (either implicitly, as the parameters and return values of your task functions, or explicitly, over a pipe) or put them in shared memory. But if your task sizes are large enough, it's usually not a problem.


Here's an example of how to parallelize your existing code:

f = np.mean
result = np.zeros(a.shape)
future_map = {}
with futures.ProcessPoolExecutor(max_workers=8) as executor:
    for i in np.ndindex(a.shape):
        future_map[executor.submit(f, a[i])] = i
    for future in futures.as_completed(future_map):
        i = future_map[future]
        result[i] = future.result()

Obviously, you could simplify it (e.g., replace the submission loop with a dict comprehension), but I wanted to make it as obvious as possible what you need to change.

Also, I'm using futures (which requires 3.2+; if you're on 2.7, install the backport from PyPI) because it makes the code a little simpler; if you need more flexibility, you want the more complex multiprocessing library.

And finally, I'm not doing any batching—each task is a single element—so the overhead is probably going to be pretty bad.

But start with this, simplify it as far as you're comfortable with, then convert it to use batches of 1, 2, and 3 axes, etc. as described earlier.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.