How to slice numpy array to extract specific indices in a multidimentional array

Question

I have a Numpy array named data of shape (300, 300, 300, 300)
I have a bool NumPy array named mask of shape (300, 300)

At runtime, values in the mask array are updated, and according to the indices of True entries in mask, I have to sum corresponding submatrices from data. The following code achieves the expected:

result = np.zeros((300, 300))
for i in range(300):
    for j in range(300):
        if mask[i, j]:
            result = result + data[i, j]

The above code is, however, very slow!

I am looking towards using NumPy's builtin slicing and indexing methods. Essentially the idea is as follows:

Use np.where(mask==True) to select the indices of mask that are True
Extract submatrix reducedData of shape (M, N, 300, 300) from data corresponding to the indices from step 1
Sum the reducedData array along axis 0, and axis 1 (np.sum(reducedData, axis=(0, 1)))

I'm not sure, how to achieve the step 2 above. Any help is appreciated!

Is np.sum(data[mask]) what you are looking for? It won't reduce the data to the dim you are describing but you'll get (N, 300, 300) out which should be sufficient for your task I guess. — raphael_mav
– raphael_mav, Commented Apr 12, 2022 at 14:02
@Ali_Sh I do not have familiarity with numba. Will it help me in speed up ? The current solution (suggested above) is causing MemoryError. — vyi
– vyi, Commented Apr 12, 2022 at 14:45
I have no idea why you got downvoted. This is a pretty textbook good question. You havea clear problem, an example and everything — Mad Physicist
– Mad Physicist, Commented Apr 13, 2022 at 2:58

Ali_Sh · Accepted Answer · 2022-04-13 16:07:06Z

2

The loops in the question can be accelerated by Numba based in the arrays sizes. So, if we can create an example as data:

rng = np.random.default_rng(85)

number = 80
a = np.zeros((number//2, number), dtype=bool)
b = np.ones((number//2, number), dtype=bool)
con = np.concatenate((a, b)).ravel()
rng.shuffle(con)
mask = con.reshape(number, number)
Data = rng.random(number ** 4, dtype=np.float64).reshape(number, number, number, number)
result = np.zeros((number, number))

we can accelerate the code by parallelized jitting through Numba as:

@nb.njit("boolean[:,::1], float64[:,::1], float64[:,:,:,::1]", parallel=True)
def numba_(mask, result, Data):
    for i in nb.prange(number):
        for j in range(number):
            if (mask[i, j]):
                result = result + Data[i, j]
    return result

This was compared on my system in terms of performance. The code ran 3 to 4 times faster for smaller arrays and with array shape (150,150,150,150) it ran about 2 times faster. Larger arrays can not be tested due to memory leaks (memory issue was for Data creation by rng.random in my tests).
To pass memory issues, my ex SO answer may be helpful. I tried to write some codes similar to the mentioned answer for this problem, but I couldn't complete this code after some time working on it (it was time-consuming to be debugged for me due to my workload) and it needs to be corrected if it needed. I put the incomplete code JUST for inspiration how to use such chunks:

@nb.njit("boolean[:,::1], float64[:,::1], float64[:,:,:,::1]", parallel=True)
def numba_(mask, result, Data):
    chunk_val = 50                        # it is arbitrary and must be chosen based on the system rams size
    chunk = Data.shape[0] // chunk_val
    chunk_res = Data.shape[0] % chunk_val

    for i in range(chunk):
        for j in range(number):
            for k in range(i * chunk_val, (i + 1) * chunk_val):
                if mask[j, k]:
                    min_ind = i * chunk_val
                    max_ind = (i + 1) * chunk_val
                    result[:, min_ind:max_ind] = result[:, min_ind:max_ind] + Data[j, k][:, min_ind:max_ind]
            if i == chunk - 1:
                for k in range((i + 1) * chunk_val, Data.shape[0]):
                    if mask[j, k]:
                        result[:, -chunk_res:] = result[:, -chunk_res:] + Data[j, k][:, -chunk_res:]
    return result

Note that this code must be evaluated in terms of memory consumption after the completion, but will be as my proposed answer using numba in terms of speed.

Benchmarks (core i5 G.no1, Ram 16):

shape = (70, 70, 70, 70)               (~x3 _ very faster)
31247600    ns    --> OP code
0           ns    --> Numba accelerated

shape = (100, 100, 100, 100)           (~x2.5)
78147400    ns    --> OP code
31229900    ns    --> Numba accelerated

shape = (150, 150, 150, 150)           (~x2)
390626400   ns    --> OP code
218730800   ns    --> Numba accelerated

shape = (170, 170, 170, 170)           (~x2)
625012500   ns    --> OP code
328132000   ns    --> Numba accelerated

Updated benchmarks on Colab (comparing to Mad Physicist solution)

10 loops, best of 5: 46.4 ms per loop    # This answer (numba)   (~x7)
10 loops, best of 5: 344  ms per loop    # Mad Physicist

edited Apr 13, 2022 at 16:07

answered Apr 12, 2022 at 15:03

Ali_Sh

2,8364 gold badges45 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mad Physicist Over a year ago

But why not just use indexing?

Mad Physicist Over a year ago

Could you benchmark against np.add.reduceat? I posted an answer.

Mad Physicist Over a year ago

Also, what's with 0ns?

Mad Physicist Over a year ago

The second number is 0. That's better than 3x faster...

Ali_Sh Over a year ago

@MadPhysicist, I have tested your solution with numba on colab, and numba was much faster.

|

Mad Physicist · Accepted Answer · 2022-04-13 08:32:51Z

2

Over trick is to use np.add.reduceat to avoid creating a giant temporary array. For example, on my machine, summing a (300, 300, 100, 100) array (which is the largest I'm able to allocate) takes less than a second this way:

np.add.reduce(data, axis=(0, 1), out=out, where=mask[..., None, None])

The only caveat is that you don't have the luxury of broadcasting the mask: you have to explicitly set the dimensions.

As it happens, your loop is not so slow: it runs ~2x faster than the vectorized operation on my machine.

np.random.seed(435)
data = np.random.uniform(size=(300, 300, 100, 100))
mask = np.random.randint(2, size=(300, 300), dtype=bool)
out = np.zeros((100, 100), float)

result = np.zeros((100, 100))
for i in range(300):
    for j in range(300):
        if mask[i, j]:
            result = result + data[i, j]

(result == out).all() ## True

%timeit np.add.reduce(data, axis=(0, 1), out=out, where=mask[..., None, None])
715 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
result = np.zeros((100, 100))
for i in range(300):
    for j in range(300):
        if mask[i, j]:
            result = result + data[i, j]
442 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Either of these methods has a huge advantage over what you are proposing, in that they do not use a large amount of memory. Your three steps boil down to

data[mask].sum(0)

You can skip step 1, since numpy allows indexing with a boolean mask, and a mask can only produce a 1D output, since the rows may contain different numbers of elements, hence the single sum. The issue is that data[mask] simply kills my interpreter because the temporary array size is too large.

edited Apr 13, 2022 at 8:32

answered Apr 13, 2022 at 3:58

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

2 Comments

vyi Over a year ago

"As it happens, your loop is not so slow: it runs ~2x faster than the vectorized operation.." Indeed, I verified this at my end. I had this impression that numpy inbuilt functions do things faster. So, for the time being, I'm going to use the loopy version and will also try to figure out how to use numba to further speed up the computations as user @Ali_Sh has suggested.

Mad Physicist Over a year ago

@vyi. You should select Ali's excellent answer

Collectives™ on Stack Overflow

How to slice numpy array to extract specific indices in a multidimentional array

2 Answers 2

Benchmarks (core i5 G.no1, Ram 16):

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Benchmarks (core i5 G.no1, Ram 16):

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related