Optimize Double loop in python

Question

I am trying to optimize the following loop :

def numpy(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):
            a[ix, iz]  = sum(c*rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum(c*rho[ix-2:ix+2, iz])
    return a, b

I tried different solutions and found using numba to calculate the sum of the product leads to better performances:

import numpy as np
import numba as nb
import time

@nb.autojit
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in range(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

def numba1(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.autojit
def numba2(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b
nx = 1024
nz = 256    

rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
a, b = numpy(nx, nz, c, rho)
print 'Time numpy  : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba1(nx, nz, c, rho)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba2(nx, nz, c, rho)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

This lead to

Time numpy : 4.1595

Time numba1 : 0.6993

Time numba2 : 1.0135

Using the numba version of the sum function (sum_opt) performs very well. But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times. I tried to use jit instead of autojit, specifying the argument types, but it was worse.

I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?

Whether it is, I am sure this double loop function can be improved a lot vectorizing the problem (like this) or using another method (map ?) but I am a little bit confused about these methods.

In the other parts of my code, I used numba and numpy slicing methods to replace all explicit loops but in this particular case, I don't how to set it up.

Any ideas ?

EDIT

Thanks for all your comments. I worked a little on this problem:

import numba as nb
import numpy as np
from scipy import signal
import time


@nb.jit(['float64(float64[:], float64[:])'], nopython=True)
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in xrange(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

@nb.autojit
def numba1(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b


@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3a(nx, nz, c, rho, a):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
    return a

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3b(nx, nz, c, rho, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return b

def convol(nx, nz, c, aa, bb):
    s1 = rho[1:nx-1,2:nz-3]
    s2 = rho[0:nx-2,2:nz-3]
    kernel = c[:,None][::-1]
    aa[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
    bb[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')
    return aa, bb


nx = 1024
nz = 256 
rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
for i in range(1000):
    a, b = numba1(nx, nz, c, rho, a, b)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = numba2(nx, nz, c, rho, a, b)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a = numba3a(nx, nz, c, rho, a)
    b = numba3b(nx, nz, c, rho, b)
print 'Time numba3 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = convol(nx, nz, c, a, b)
print 'Time convol : ' + `round(time.clock() - ti, 4)`

Your solution is very elegant Divakar, but I have to use this function a large number of time in my code. So, for 1000 iterations, this lead to

Time numba1 : 3.2487

Time numba2 : 3.7012

Time numba3 : 3.2088

Time convol : 22.7696

autojit and jit are very close. However, when using jit, it seems important to specify all argument types.

I do not know if there is a way to specify argument types in the jit decorator when the function has multiple outputs. Someone ?

For now I did not find other solution than using numba. New ideas are welcomed !

write a iterator to do the job, its fast, efficient and consumes less memory than your double loops. — vijay shanker
– vijay shanker, Commented May 13, 2015 at 10:04

Divakar · Accepted Answer · 2015-05-13 13:17:15Z

You are basically performing 2D convolution there, with a small modification that your kernel is not reversing as the usual convolution operation does. So, basically, there are two things we need to do here to use signal.convolve2d to solve our case -

Slice the input array rho to select a portion of it which is used in the original loopy version of your code. This would be the input data to convolution.
Reverse the kernel, c and feed it alongwith the sliced data to signal.convolve2d.

Please note that these are to be done for calculation of both a and b separately.

Here's the implementation -

import numpy as np
from scipy import signal

# Slices for convolutions to get a and b respectively        
s1 = rho[1:nx-1,2:nz-3]
s2 = rho[0:nx-2,2:nz-3]
kernel = c[:,None][::-1]  # convolution kernel

# Setup output arrays and fill them with convolution results
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

a[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
b[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')

If you don't need the extra zeros around the boundaries of the output arrays, you could simply use the outputs from signal.convolve2d as they are, which must further boost up the performance.

Runtime tests

In [532]: %timeit loop_based(nx, nz, c, rho)
1 loops, best of 3: 1.52 s per loop

In [533]: %timeit numba1(nx, nz, c, rho)
1 loops, best of 3: 282 ms per loop

In [534]: %timeit numba2(nx, nz, c, rho)
1 loops, best of 3: 509 ms per loop

In [535]: %timeit conv_based(nx, nz, c, rho)
10 loops, best of 3: 15.5 ms per loop

So, for the actual input datasize, the proposed convolution based approach is about 100x faster than the loopy code and about 20x better than the fastest numba based approach numba1.

Jaime · Accepted Answer · 2015-05-13 10:51:23Z

6

You are not taking full advantage of numpy's capabilities. The numpythonic way of handling your problem would be something like:

cs = np.zeros((nx+1, nz))
np.cumsum(c*rho, axis=0, out=cs[1:])
aa = cs[5:, 2:-3] - cs[1:-4, 2:-3]
bb = cs[4:-1, 2:-3] - cs[:-5, 2:-3]

aa will now hold the central, non-zero part of your a array:

>>> a[:5, :5]
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  2.31296595,  2.15743042,  2.5853117 ],
       [ 0.        ,  0.        ,  2.02697233,  2.83191016,  2.58819583],
       [ 0.        ,  0.        ,  2.4086584 ,  2.45175615,  2.19628507]])
>>>aa[:3, :3]
array([[ 2.31296595,  2.15743042,  2.5853117 ],
       [ 2.02697233,  2.83191016,  2.58819583],
       [ 2.4086584 ,  2.45175615,  2.19628507]])

and similarly for bb and b.

On my system, with your sample input, this code runs over 300x faster than your numpy function. According to your timings, that is going to be one or two orders of magnitude faster than numba.

answered May 13, 2015 at 10:51

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

2 Comments

Ipse Lium Over a year ago

How do you multiply c and rho ? Their dimensions are not equal : (4,) and (1024,256).

Jaime Over a year ago

Oops... I misread your code and figured it was a constant. That's going to make things slower, but it is still doable, I'll try to rework my solution...

EvenLisle · Accepted Answer · 2015-05-13 10:49:10Z

As stated in the performance section on continuum's blog, autojit compiles just-in-time, while jit compiles ahead of time:

Numba can compile just-in-time with the autojit decorator, or ahead of time with the jit decorator

This means that in many cases, autojit means the compiler can make a more educated guess as to the code it compiles, and optimize thereafter. I know just-in-time compilation ahead of time sounds contradictory, but hey.

But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times

Numba doesn't increase the performance of arbitrary function calls. While I can't say for sure, my guess is that the overhead of the JIT compilation outweighs the benefit from doing it (if there's any benefit at all).

I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?

This is likely due to a cache miss. A 2-dimensional array is allocated as a contiguous chunk of memory of size rows * columns. What is fetched to cache is based on a combination of temporal (what has recently been used) and spatial (what is close in memory to what has been used) locality, i.e. what is believed to be used next.

When iterating over rows first, you iterate in the order the data appears in memory. When iterating over columns first, you "skip" the width of a row in memory each time, making it less likely that the memory location you're accessing has been fetched to cache.

2D array: [[1,2,3],[4,5,6],[7,8,9]]
In memory:  1 2 3   4 5 6   7 8 9

Let's assume an over-simplified, stupid cache fetch algorithm which fetches 3 subsequent memory locations.

Iterating row-first:

In memory:    1    2    3  |  4    5    6  |  7    8    9
Accessed:     1    2    3  |  4    5    6  |  7    8    9
Cache miss:   -    -    -  |  -    -    -  |  -    -    -

Iterating column-first:

In memory:    1    2    3  |  4    5    6  |  7    8    9
Accessed:     1    4    7  |  2    5    8  |  3    6    9
Cache miss:   -    -    -  |  x    x    x  |  x    x    x

I feel like you're saying that autojit works like HotSpot in Java, where it is decided at runtime which functions are worth compiling. Whereas, it looks like autojit creates a compiled function for each unique set of argument types it gets called with. That is, the key difference is that autojit compiles in all instances, whereas HotSpot only compiles frequently used functions.

score 1 · Accepted Answer · 2015-05-19 09:20:22Z

1

Numba is very fast in nopython mode but with your code it has to fall back to object mode, which is a lot slower. You can see this happening if you pass nopython=True to the jit decorator.

It does compile in nopython mode (at least in Numba version 0.18.2) if you pass a and b as arguments:

import numba as nb

@nb.jit(nopython=True)
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in range(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

Note that in the release notes there is mention of autojit being deprecated in favor of jit.

Apparently you're not satisfied yet. So how about a solution based on stride_tricks?

from numpy.lib.stride_tricks import as_strided

def stridetrick_einsum(c, rho, out):
    ws = len(c)
    nx, nz = rho.shape

    shape = (nx-ws+1, ws, nz)
    strides = (rho.strides[0],) + rho.strides
    rho_windowed = as_strided(rho, shape, strides)

    np.einsum('j,ijk->ik', c, rho_windowed, out=out)

a = np.zeros((nx, nz))
stridetrick_einsum(c, rho[1:-1,2:-3], a[2:-3,2:-3])
b = np.zeros((nx, nz))
stridetrick_einsum(c, rho[0:-2,2:-3], b[2:-3,2:-3])

What's more, since a and b are obviously almost exactly the same, you can calculate them in one go and then copy the values over:

a = np.zeros((nx, nz))
stridetrick_einsum(c, rho[:-1,2:-3], a[1:-3,2:-3])
b = np.zeros((nx, nz))
b[2:-3,2:-3] = a[1:-4,2:-3]
a[1,2:-3] = 0.0

edited May 19, 2015 at 9:20

answered May 13, 2015 at 13:07

user2379410

2 Comments

Ipse Lium Over a year ago

Thanks, Didn't know about strides ! It works well and faster than numba implementation, but I am a little bit confused about this formulation. Reading the doc helps me to understand in the 2D case, but it is still obscure for me in this case. For example, if I have to make the same calculation with rho[ix, iz-1:iz+3], I guess I have to change shape to (nx, ws, nz-ws+1), but is strides and rho_windowed remain the same ? And about 'j,ijk->ik' in einsum, what are the changes I have to do ?

user2379410 Over a year ago

@IpseLium, Yeah I understand your confusion. You're right about the new shape though. Indeed you would also need to change the strides and einsum call. But a lot easier would be to keep the function the same as in my answer, and call it like stridetrick_einsum(c, rho[2:-3,1:-1].T, a[2:-3,2:-3].T). I cannot test this right now but it should work.

Collectives™ on Stack Overflow

Optimize Double loop in python

4 Answers 4

Comments

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related