Efficiently creating masks - Numpy /Python

Question

Suppose I have a NumPy array with shape (50, 10000, 10000) with 1000 distinct "clusters". For example, there would be small volume somewhere with just 1s, another small volume with 2s, etc. I would like to iterate through each cluster to create a mask like so:

for i in np.unique(arr)[1:]:
    mask = arr == i
    #do other stuff with mask

Creating each mask takes about 15 seconds, and iterating through 1000 clusters would take more than 4 hours. Is there a possible way to speed up the code or is this the best there is since there is no avoiding iterating through each element of the array?

EDIT: the dtype of the array is uint16

How do you plan to use those masks? Because pre-computing all masks would result in a huge mask array, so don't think that's a good idea. — Divakar
– Divakar, Commented Nov 10, 2017 at 5:47
@cᴏʟᴅsᴘᴇᴇᴅ Sorry. Should mention that the dtype is uint16, so not really 5GB. — onepint16oz
– onepint16oz, Commented Nov 10, 2017 at 5:50
@Divakar the mask is used to calculate volumetric centroid of each cluster. — onepint16oz
– onepint16oz, Commented Nov 10, 2017 at 5:52

Daniel F · Accepted Answer · 2017-11-10 08:48:20Z

1

I'm assuming arr is sparse:

you say the clusters are small, and 1000 clusters isn't going to tile an array that big
you iterate over np.unique(arr)[1:], so I assume the first unique value is 0

In this case I would recommend leveraging a scipy.sparse.csr_matrix

from scipy.sparse import csr_matrix
sp_arr = csr_matrix(arr.reshape(1,-1))

This turns your big dense array into a one-row compressed sparse row array. Since sparse arrays don't like more than 2 dimensions, this tricks it into using ravelled indices. Now sp_arr has data (the cluster labels), indices (the ravelled indices), and indptr (which is trivial here since we only have one row). So,

for i in np.unique(sp_arr.data):  # as a bonus this `unique` call should be faster too
    x, y, z = np.unravel_index(sp_arr.indices[sp_arr.data == i], arr.shape)

Should much more efficiently give equivalent coordinates to

for i in np.unique(arr)[:1]:
    x, y, z = np.nonzero(arr == i)

where x, y, z are the indices of the True values in mask. From there you can either reconstruct mask or work off the indices (recommended).

You could also do this purely with numpy, and still have a boolean mask at the end, but a bit less memory efficient:

all_mask = arr != 0    # points assigned to any cluster
data = arr[all_mask]   # all cluster labels
for i in np.unique(data):
    mask = all_mask.copy()
    mask[mask] = data == i  # now mask is same as before

edited Nov 10, 2017 at 8:48

answered Nov 10, 2017 at 8:28

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

onepint16oz Over a year ago

Wow. Thank you for the answer. Working with indices is much faster than recreating masks. Are there some resources out there where I can learn more about leveraging numpy/scipy?

Daniel F Over a year ago

That resource is pretty much the online documentation, and StackOverflow. Try searching questions or just browse the post history of Divakar, MSiefert, COLDSPEED,etc.

Collectives™ on Stack Overflow

Efficiently creating masks - Numpy /Python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related