numpy make sub-arrays based of unique column

Question

I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...

array([[1, 1, 0, 1],
       [0, 1, 0, 0],
       [1, 1, 1, 0],
       [0, 0, 1, 2],
       [0, 1, 3, 2],
       [1, 1, 0, 1],
       [0, 1, 0, 0]])

With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...

df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]

where y is ...

[array([[0, 0, 1, 2, 0]]),
 array([[0, 1, 0, 0, 1],
        [0, 1, 3, 2, 1],
        [0, 1, 0, 0, 1]]),
 array([[ 1,  1,  0,  1, 11],
        [ 1,  1,  1,  0, 11],
        [ 1,  1,  0,  1, 11]])]

This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.

Any ideas on how to speed this up?

Try itertools.groupby(). It just returns an iterator and you can put it in a container when you want. — the23Effect
– the23Effect, Commented Mar 15, 2021 at 20:20
@PabloC, no lots of different variables. In my actual dataset I take a factorized version of four columns, this is just a simplified version. — Awitz
– Awitz, Commented Mar 15, 2021 at 20:20
Your method will be too long as you are constructing DF and lot of type conversions and then going through to get unique keys. — the23Effect
– the23Effect, Commented Mar 15, 2021 at 20:22

JoeCondron · Accepted Answer · 2021-03-16 22:36:12Z

3

What about using the numpy_indexed library:

import numpy as np
import numpy_indexed as npi

a = np.array([[1, 1, 0, 1],
       [0, 1, 0, 0],
       [1, 1, 1, 0],
       [0, 0, 1, 2],
       [0, 1, 3, 2],
       [1, 1, 0, 1],
       [0, 1, 0, 0]])

key = np.dot(a[:,:2], [1, 10])
y = npi.group_by(key).split_array_as_list(arr)

Output

y
[array([[0, 0, 1, 2]]), 
 array([[0, 1, 0, 0],
        [0, 1, 3, 2],
        [0, 1, 0, 0]]),
 array([[ 1,  1,  0,  1],
        [ 1,  1,  1,  0],
        [ 1,  1,  0,  1]])]

You can easily install the library with:

> pip install numpy-indexed

edited Mar 16, 2021 at 22:36

JoeCondron

8,9263 gold badges29 silver badges28 bronze badges

answered Mar 15, 2021 at 20:39

Pablo C

4,7612 gold badges10 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JoeCondron Over a year ago

I suggest using np.unique(.., return_inverse=True) to get the key since it does not assume anything about the data. The above only works for single-digit integers.

the23Effect · Accepted Answer · 2021-03-16 04:54:39Z

1

Let me know if this performs better,

from collections import defaultdict
import numpy as np


outgen = defaultdict(lambda: [])

# arr: The input numpy array, :type: np.ndarray.
c = map(lambda x: ((x[0], x[1]), x), arr)
for key, val in c:
    outgen[key].append(val)

# outgen: The required output, :type: list[np.ndarray].
outgen = [np.array(x) for x in outgen.values()]

edited Mar 16, 2021 at 4:54

answered Mar 15, 2021 at 23:23

the23Effect

5902 silver badges7 bronze badges

Comments

JoeCondron · Accepted Answer · 2021-03-16 09:23:50Z

1

You can use np.unique directly here.

unique, indexer = np.unique(arr[:, :2], axis=0, return_inverse=True)
{i: arr[indexer == k, :] for i, k in enumerate(unique)}

This is probably about as good as it gets for your desired output. However, instead of splitting it into a list of subarrays you could sort it by the unique key and then work with slices. This might be helpful if there are many unique values leading to a long list.

arr[:] = arr[np.argsort(indexer), :]    # not sure if this is guaranteed to preserve the order within each group

EDIT:

Here is a powerful solution which I have been using for a sort of 2-D factorization. It takes 8ms for 1 million rows of single digit integers (vs > 100ms for np.unique).

columns = x[:, 0], x[:, 1]
factored = map(pd.factorize, columns)
codes, unique_values = map(list, zip(*factored))
group_index = get_group_index(codes, map(len, unique_values), sort=False, xnull=False)

It uses the internal algorithm of Dataframe.drop_duplicates. Note that the ordering of the keys is not the sort order of the unique tuples.

There is also a new open source library, riptable which emulates numpy and pandas in some ways but is can be a lot more powerful. The creation of th takes around 4ms

import riptable as rt

columns = [x[:, 0], x[:, 1]]
unique_values, key = rt.unique(columns,  return_inverse=True)

Here, unique_values is a tuple containing two arrays which can be zipped to get the unique tuples

edited Mar 16, 2021 at 9:23

answered Mar 15, 2021 at 20:23

JoeCondron

8,9263 gold badges29 silver badges28 bronze badges

4 Comments

the23Effect Over a year ago

Sorting will not be O(n) right in this scenario ? Where as hashing will be O(n+k) isn't it ?

the23Effect Over a year ago

But this looks a lot neater.

JoeCondron Over a year ago

Correct. The sorting idea was just an addendum which might help with memory if that's an issue. The array split he is looking for is achieved in the first two lines (I used a dict for ease of reference but actually a list is as good since the keys are just integers from 0 to N-1. I can't see it becoming more succinct without assumptions about the data, e.g., they are all single-digit integers like in @Pablo C's solution.

JoeCondron Over a year ago

This solution is actually disappointingly slow. Pandas appears to have more powerful tools for this. Editing my answer with a routine I have been using for years.

Collectives™ on Stack Overflow

numpy make sub-arrays based of unique column

3 Answers 3

1 Comment

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related