2

Suppose I have a NumPy array with shape (50, 10000, 10000) with 1000 distinct "clusters". For example, there would be small volume somewhere with just 1s, another small volume with 2s, etc. I would like to iterate through each cluster to create a mask like so:

for i in np.unique(arr)[1:]:
    mask = arr == i
    #do other stuff with mask

Creating each mask takes about 15 seconds, and iterating through 1000 clusters would take more than 4 hours. Is there a possible way to speed up the code or is this the best there is since there is no avoiding iterating through each element of the array?

EDIT: the dtype of the array is uint16

9
  • You have a 5GB numpy array? Commented Nov 10, 2017 at 5:47
  • How do you plan to use those masks? Because pre-computing all masks would result in a huge mask array, so don't think that's a good idea. Commented Nov 10, 2017 at 5:47
  • @cᴏʟᴅsᴘᴇᴇᴅ Sorry. Should mention that the dtype is uint16, so not really 5GB. Commented Nov 10, 2017 at 5:50
  • @Divakar the mask is used to calculate volumetric centroid of each cluster. Commented Nov 10, 2017 at 5:52
  • Stick to the loopy version for such a huge data. Commented Nov 10, 2017 at 5:52

1 Answer 1

1

I'm assuming arr is sparse:

  • you say the clusters are small, and 1000 clusters isn't going to tile an array that big
  • you iterate over np.unique(arr)[1:], so I assume the first unique value is 0

In this case I would recommend leveraging a scipy.sparse.csr_matrix

from scipy.sparse import csr_matrix
sp_arr = csr_matrix(arr.reshape(1,-1))

This turns your big dense array into a one-row compressed sparse row array. Since sparse arrays don't like more than 2 dimensions, this tricks it into using ravelled indices. Now sp_arr has data (the cluster labels), indices (the ravelled indices), and indptr (which is trivial here since we only have one row). So,

for i in np.unique(sp_arr.data):  # as a bonus this `unique` call should be faster too
    x, y, z = np.unravel_index(sp_arr.indices[sp_arr.data == i], arr.shape)

Should much more efficiently give equivalent coordinates to

for i in np.unique(arr)[:1]:
    x, y, z = np.nonzero(arr == i)

where x, y, z are the indices of the True values in mask. From there you can either reconstruct mask or work off the indices (recommended).

You could also do this purely with numpy, and still have a boolean mask at the end, but a bit less memory efficient:

all_mask = arr != 0    # points assigned to any cluster
data = arr[all_mask]   # all cluster labels
for i in np.unique(data):
    mask = all_mask.copy()
    mask[mask] = data == i  # now mask is same as before
Sign up to request clarification or add additional context in comments.

2 Comments

Wow. Thank you for the answer. Working with indices is much faster than recreating masks. Are there some resources out there where I can learn more about leveraging numpy/scipy?
That resource is pretty much the online documentation, and StackOverflow. Try searching questions or just browse the post history of Divakar, MSiefert, COLDSPEED,etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.