Increasing Efficiency of Checking for Duplicates -- Python

Question

I'm a researcher working with climate model output using Python to find certain types of storms. I have 8 large numpy arrays (dimensions are 109574 x 52 x 57). These arrays are filled with 1's to signify there was a storm on that day (first dimension is time), 0 for no storm. The other two dimensions are latitude and longitude.

I have to eliminate back-to-back days from these arrays. For example, if there was a storm on day 1 and day 2, I would like to only count 1 storm. If there was a storm Day 1, 2, and 3, I would like to only count 1 and 3 for a total of two storms, Day 1-4 would have 2 storms, and so on. I found the # of storms at the end using np.sum to tally up the 1's in the array along the time axis.

I am running the following code to achieve this, but I am faced with the issue that it is extremely slow. Because I will have to repeat this procedure for other datasets, I was wondering if there was a way to speed this process up to be more efficient. I have my code below, and I would be more than happy to clarify anything.

# If there is a storm that overlaps two two-day periods, only count it once
print("Eliminating doubles...")
for i in range(52):
    for j in range(57):
        print(i,j)
        for k in range(109573):
            if((storms1[k,i,j]) == 1 and (storms1[k+1,i,j] == 1)):
                storms1[k,i,j] = 0
            if((storms2[k,i,j]) == 1 and (storms2[k+1,i,j] == 1)):
                storms2[k,i,j] = 0
            if((storms3[k,i,j]) == 1 and (storms3[k+1,i,j] == 1)):
                storms3[k,i,j] = 0
            if((storms4[k,i,j]) == 1 and (storms4[k+1,i,j] == 1)):
                storms4[k,i,j] = 0
            if((storms5[k,i,j]) == 1 and (storms5[k+1,i,j] == 1)):
                storms5[k,i,j] = 0
            if((storms6[k,i,j]) == 1 and (storms6[k+1,i,j] == 1)):
                storms6[k,i,j] = 0
            if((storms7[k,i,j]) == 1 and (storms7[k+1,i,j] == 1)):
                storms7[k,i,j] = 0
            if((storms8[k,i,j]) == 1 and (storms8[k+1,i,j] == 1)):
                storms8[k,i,j] = 0

Before someone suggests iterating through the arrays with a loop, I changed the variable names to simplify them for the purpose of asking this question.

Thanks for the help.

Paul Panzer · Accepted Answer · 2017-02-13 01:18:28Z

Here is a vectorised function that can replace your innermost loop:

def do(KK):
    # find stretches of ones
    switch_points = np.where(np.diff(np.r_[0, KK, 0]))[0]
    switch_points.shape = -1, 2
    # isolate stretches starting on odd days and create mask
    odd_starters = switch_points[switch_points[:, 0] % 2 == 1, :]
    odd_mask = np.zeros((KK.shape[0] + 1,), dtype=KK.dtype)
    odd_mask[odd_starters] = 1, -1
    odd_mask = np.add.accumulate(odd_mask[:-1])
    # apply global 1,0,1,0,1,0,... mask
    KK[1::2] = 0
    # invert stretches starting on odd days
    KK ^= odd_mask

call it from within the outer pair of loops (i and j):

do(storms1[:, i, j])
do(storms2[:, i, j])
etc.

It will change the arrays in-place.

This should be much faster than looping (the two outer loops don't make a difference).

How it works:

It finds the starting and endpoints of blocks of ones. We know that in each such block every other one has to be zerod. Using a global 1,0,1,0,1,0,... mask the algorithm zeros out every other day.

That produces

the correct result in blocks that start on even days
no change outside blocks
and the complement of the correct pattern in blocks that start on odd days

The last step of the algorithm is to invert these odd-starting blocks.

Benjamin · Accepted Answer · 2017-02-13 05:28:49Z

An example using a 1D array that simulates your first axis. First, find where the groups of 1's start. Next, find the length of each group. Finally, calculate the number of events based on your logic:

import numpy

a = numpy.random.randint(0,2,20)

# Add an initial 0
a1 = numpy.r_[0, a]

# Mark the start of each group of 1's
d1 = numpy.diff(a1) > 0

# Indices of the start of groups of 1's
w1 = numpy.arange(len(d1))[d1]

# Length of each group
cs = numpy.cumsum(a)
c = numpy.diff(numpy.r_[cs[w1], cs[-1]+1])

# Apply the counting logic
storms = c - c//2

print(a)
>>> array([0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1])
print(c)
>>> array([1, 2, 4, 1, 3])
print(storms)
>>> array([1, 1, 2, 1, 2])

You could save more memory than what I show here by reusing variable names after they are no longer needed, etc.

Daniel F · Accepted Answer · 2017-02-13 10:04:26Z

0

So I think you want:

storms_in[:,i,j] = [0,0,1,1,0,1,1,1,0,1,0,1,1,1,1,0]
storms_out[:,i,j]= [0,0,1,0,0,1,0,1,0,1,0,1,0,0,1,0]

This is not what your code sample is doing, but is what you say you want to do in your second paragraph.

To do this, you'd need two steps

def storms_disc(storms):  # put the whole array here, boolean-safe
    z = np.zeros((1,) + storms.shape[1:]) # zero-pads for the ends
    changes = np.r_[storms.astype('int8') ,z] - np.r_[z, storms.astype('int8')]  #find where the weather changes
    changes=((changes[:-1] == 1) | (changes[1:] == -1)).astype('int8') # reduce dimension
    return ((np.r_[changes, z] - np.r_[z, changes])[:-1] == 1).astype(storms.dtype) #find the first of successive changes

This vectorizes the whole process, and you'd only need to call it 8 times. The astype calls are because subtracting booleans causes an error, even though their value is 1 and 0

Testing:

storms=np.random.randint(0,2,90).reshape(10,3,3)
storms.T

array([[[1, 0, 0, 1, 1, 1, 1, 1, 1, 0],
        [0, 0, 1, 1, 0, 1, 1, 0, 0, 1],
        [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]],

       [[0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
        [0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
        [0, 1, 0, 0, 1, 0, 1, 0, 1, 1]],

       [[0, 1, 0, 1, 0, 1, 1, 0, 0, 0],
        [0, 1, 0, 1, 0, 1, 0, 0, 1, 1],
        [0, 0, 0, 1, 1, 1, 0, 0, 1, 0]]], dtype=int8)

storms_disc(storms).T

array([[[1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]],

       [[0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
        [0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
        [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]],

       [[0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
        [0, 1, 0, 1, 0, 1, 0, 0, 1, 0],
        [0, 0, 0, 1, 0, 1, 0, 0, 1, 0]]], dtype=int8)

edited Feb 13, 2017 at 10:04

answered Feb 13, 2017 at 8:49

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

3 Comments

Benjamin Over a year ago

Note that you can view (a.view(numpy.bool)) an 8-bit int array as boolean, because Numpy boolean types are also 8 bit. This saves the type conversion.

Daniel F Over a year ago

Not sure where I'd do that. I'm mostly converting to int8 so I can subtract. I suppose I could replace the .astype(storms.dtype) at the end.

Benjamin Over a year ago

The "view" trick works the other way around too... But yes, I read your code a little bit too fast.

Collectives™ on Stack Overflow

Increasing Efficiency of Checking for Duplicates -- Python

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related