4

A lot of numpy functions provide the option to operate over certain axis with axis= parameter. My question is

  1. How this 'along axis' operations are implemented? Or, more direct question
  2. How do I efficiently write my own function providing the similar option?

I notice numpy provides a function numpy.apply_along_axis that will serve as an answer if the base function input is 1-D array.

But what if my base function needs a multi-dimensional input? E.g. to find the 2-D moving average B of an np matrix A of shape (5,6,2,3,4) along the first two dimensions (5,6)? Like a generic function B = f_moving_mean(A, axis=(0,1))

My current solution is to use numpy.swapaxes and numpy.reshape to accomplish this. A sample code for a 1-D moving-average function being:

import pandas as pd
import numpy as np
def nanmoving_mean(data,window,axis=0):
    kw = {'center':True,'window':window,'min_periods':1}
    if len(data.shape)==1:
        return pd.Series(data).rolling(**kw).mean().as_matrix()
    elif len(data.shape)>=2:
        tmp = np.swapaxes(data,0,axis)
        tmpshp = tmp.shape
        tmp = np.reshape( tmp, (tmpshp[0],-1), order='C' )
        tmp = pd.DataFrame(tmp).rolling(**kw).mean().as_matrix()
        tmp = np.reshape( tmp, tmpshp, order='C' )
        return np.swapaxes(tmp,0,axis)
    else:
        print('Invalid dimension!')
        return None

data = np.random.randint(10,size=(2,3,6))
print(data)
nanmoving_mean(data,window=3,axis=2)

Is this a common / efficient way of implementation for the question 2? Any improving / suggestion / new method is welcome.

PS. The reason I'm involving pandas here is that its rolling(...).mean() method is able to deal with nan data properly.

Edit: I guess another way of asking the question could be: What's the syntax of looping over 'dynamical' number of dimensions?

0

2 Answers 2

1

We could have an approach using 2D convolution.

The basic steps would be :

  • As pre-processing step, replace NaNs with 0s as we need to do windowed summation on the input data.
  • Get the windowed summations with Scipy's convolve2d for the data values and also the mask of NaNs. We will use boundary elements as zeros.
  • Subtract the windowed count of NaNs from the window size to get count of valid elements responsible for summations.
  • For the boundary elements, we would have gradually lesser elements accounting for the summations.

Now, these intervaled-summations could also be obtained by Scipy's1Duniform-filter that is comparatively more efficient. Other benefit is that we could specify the axis along which these summations/averaging are to be performed.

With a mix of Scipy's 2D convolution and 1D uniform filter, we would have few approaches as listed next.

Import relevant Scipy functions -

from scipy.signal import convolve2d as conv2
from scipy.ndimage.filters import uniform_filter1d as uniff

Approach #1 :

def nanmoving_mean_numpy(data, W): # data: input array, W: Window size
    N = data.shape[-1]
    hW = (W-1)//2

    nan_mask = np.isnan(data)
    data1 = np.where(nan_mask,0,data)

    value_sums = conv2(data1.reshape(-1,N),np.ones((1,W)),'same', boundary='fill')
    nan_sums = conv2(nan_mask.reshape(-1,N),np.ones((1,W)),'same', boundary='fill')

    value_sums.shape = data.shape
    nan_sums.shape = data.shape

    b_sizes = hW+1+np.arange(hW) # Boundary sizes
    count = np.hstack(( b_sizes , W*np.ones(N-2*hW), b_sizes[::-1] ))
    return value_sums/(count - nan_sums)

Approach #2 :

def nanmoving_mean_numpy_v2(data, W): # data: input array, W: Window size    
    N = data.shape[-1]
    hW = (W-1)//2

    nan_mask = np.isnan(data)
    data1 = np.where(nan_mask,0,data)

    value_sums = uniff(data1,size=W, axis=-1, mode='constant')*W
    nan_sums = conv2(nan_mask.reshape(-1,N),np.ones((1,W)),'same', boundary='fill')
    nan_sums.shape = data.shape

    b_sizes = hW+1+np.arange(hW) # Boundary sizes
    count = np.hstack(( b_sizes , W*np.ones(N-2*hW,dtype=int), b_sizes[::-1] ))
    out =  value_sums/(count - nan_sums)
    out = np.where(np.isclose( count, nan_sums), np.nan, out)
    return out

Approach #3 :

def nanmoving_mean_numpy_v3(data, W): # data: input array, W: Window size
    N = data.shape[-1]
    hW = (W-1)//2

    nan_mask = np.isnan(data)
    data1 = np.where(nan_mask,0,data)    
    nan_avgs = uniff(nan_mask.astype(float),size=W, axis=-1, mode='constant')

    b_sizes = hW+1+np.arange(hW) # Boundary sizes
    count = np.hstack(( b_sizes , W*np.ones(N-2*hW), b_sizes[::-1] ))
    scale = ((count/float(W)) - nan_avgs)
    out = uniff(data1,size=W, axis=-1, mode='constant')/scale
    out = np.where(np.isclose( scale, 0), np.nan, out)
    return out

Runtime test

Dataset #1 :

In [807]: # Create random input array and insert NaNs
     ...: data = np.random.randint(10,size=(20,30,60)).astype(float)
     ...: 
     ...: # Add 10% NaNs across the data randomly
     ...: idx = np.random.choice(data.size,size=int(data.size*0.1),replace=0)
     ...: data.ravel()[idx] = np.nan
     ...: 
     ...: W = 5 # Window size
     ...: 

In [808]: %timeit nanmoving_mean(data,window=W,axis=2)
     ...: %timeit nanmoving_mean_numpy(data, W)
     ...: %timeit nanmoving_mean_numpy_v2(data, W)
     ...: %timeit nanmoving_mean_numpy_v3(data, W)
     ...: 
10 loops, best of 3: 22.3 ms per loop
100 loops, best of 3: 3.31 ms per loop
100 loops, best of 3: 2.99 ms per loop
1000 loops, best of 3: 1.76 ms per loop

Dataset #2 [Bigger dataset] :

In [811]: # Create random input array and insert NaNs
 ...: data = np.random.randint(10,size=(120,130,160)).astype(float)
 ...: 
 ...: # Add 10% NaNs across the data randomly
 ...: idx = np.random.choice(data.size,size=int(data.size*0.1),replace=0)
 ...: data.ravel()[idx] = np.nan
 ...: 

In [812]: %timeit nanmoving_mean(data,window=W,axis=2)
     ...: %timeit nanmoving_mean_numpy(data, W)
     ...: %timeit nanmoving_mean_numpy_v2(data, W)
     ...: %timeit nanmoving_mean_numpy_v3(data, W)
     ...: 
1 loops, best of 3: 796 ms per loop
1 loops, best of 3: 486 ms per loop
1 loops, best of 3: 275 ms per loop
10 loops, best of 3: 161 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

1

Without getting too much into your question, here's the key part of the apply_along_axis function (viewed via Ipython)

        res = func1d(arr[tuple(i.tolist())], *args, **kwargs)
        outarr[tuple(ind)] = res

They construct two index objects, i, and ind which are varied. Assume we specify axis=2, then this code does

outarr[i,j,l] = func1d( arr[i,j,:,l], ...)

for all possible values of i,j, and l. So there's lots of code for a rather basic iterative calculation.

ind = [0]*(nd-1)   # ind is just a nd-1 list

i = zeros(nd, 'O')        # i is a 1d array with a `slice` object
i[axis] = slice(None, None)

I'm not familiar with the pandas rolling. But there are number of numpy rolling-mean questions. scipy.signal.convolve2d might be of use. np.lib.stride_tricks.as_strided also is used.

You idea of using reshape and swapaxis (or transpose) to reduce the complexity of the dimension space is also good.

(this isn't a solution; rather throw out some ideas that come to mind, remembering other 'moving average' questions. It's too late to develop more.)

4 Comments

I don't where to look at the code of apply_along_axis, but how does it construct i and ind?
@ShichuZhu If you are looking for performance, apply_along_axis won't help.
@Divakar If the base function only deal with 1D then looping over all other dimension is the only way. Unless revise the base function to include vectorized operation i guess?
@ShichuZhu No need to loop. There are many vectorized options available to sum elems in windows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.