0

I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...

array([[1, 1, 0, 1],
       [0, 1, 0, 0],
       [1, 1, 1, 0],
       [0, 0, 1, 2],
       [0, 1, 3, 2],
       [1, 1, 0, 1],
       [0, 1, 0, 0]])

With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...

df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]

where y is ...

[array([[0, 0, 1, 2, 0]]),
 array([[0, 1, 0, 0, 1],
        [0, 1, 3, 2, 1],
        [0, 1, 0, 0, 1]]),
 array([[ 1,  1,  0,  1, 11],
        [ 1,  1,  1,  0, 11],
        [ 1,  1,  0,  1, 11]])]

This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.

Any ideas on how to speed this up?

7
  • are the first two columns always full of 1s and 0s? Commented Mar 15, 2021 at 20:18
  • Try itertools.groupby(). It just returns an iterator and you can put it in a container when you want. Commented Mar 15, 2021 at 20:20
  • @PabloC, no lots of different variables. In my actual dataset I take a factorized version of four columns, this is just a simplified version. Commented Mar 15, 2021 at 20:20
  • Your method will be too long as you are constructing DF and lot of type conversions and then going through to get unique keys. Commented Mar 15, 2021 at 20:22
  • @the23Effect even if I convert it back to an array? Commented Mar 15, 2021 at 20:23

3 Answers 3

3

What about using the numpy_indexed library:

import numpy as np
import numpy_indexed as npi

a = np.array([[1, 1, 0, 1],
       [0, 1, 0, 0],
       [1, 1, 1, 0],
       [0, 0, 1, 2],
       [0, 1, 3, 2],
       [1, 1, 0, 1],
       [0, 1, 0, 0]])

key = np.dot(a[:,:2], [1, 10])
y = npi.group_by(key).split_array_as_list(arr)

Output

y
[array([[0, 0, 1, 2]]), 
 array([[0, 1, 0, 0],
        [0, 1, 3, 2],
        [0, 1, 0, 0]]),
 array([[ 1,  1,  0,  1],
        [ 1,  1,  1,  0],
        [ 1,  1,  0,  1]])]

You can easily install the library with:

> pip install numpy-indexed
Sign up to request clarification or add additional context in comments.

1 Comment

I suggest using np.unique(.., return_inverse=True) to get the key since it does not assume anything about the data. The above only works for single-digit integers.
1

Let me know if this performs better,

from collections import defaultdict
import numpy as np


outgen = defaultdict(lambda: [])

# arr: The input numpy array, :type: np.ndarray.
c = map(lambda x: ((x[0], x[1]), x), arr)
for key, val in c:
    outgen[key].append(val)

# outgen: The required output, :type: list[np.ndarray].
outgen = [np.array(x) for x in outgen.values()]

Comments

1

You can use np.unique directly here.

unique, indexer = np.unique(arr[:, :2], axis=0, return_inverse=True)
{i: arr[indexer == k, :] for i, k in enumerate(unique)}

This is probably about as good as it gets for your desired output. However, instead of splitting it into a list of subarrays you could sort it by the unique key and then work with slices. This might be helpful if there are many unique values leading to a long list.

arr[:] = arr[np.argsort(indexer), :]    # not sure if this is guaranteed to preserve the order within each group

EDIT:

Here is a powerful solution which I have been using for a sort of 2-D factorization. It takes 8ms for 1 million rows of single digit integers (vs > 100ms for np.unique).

columns = x[:, 0], x[:, 1]
factored = map(pd.factorize, columns)
codes, unique_values = map(list, zip(*factored))
group_index = get_group_index(codes, map(len, unique_values), sort=False, xnull=False)

It uses the internal algorithm of Dataframe.drop_duplicates. Note that the ordering of the keys is not the sort order of the unique tuples.

There is also a new open source library, riptable which emulates numpy and pandas in some ways but is can be a lot more powerful. The creation of th takes around 4ms

import riptable as rt

columns = [x[:, 0], x[:, 1]]
unique_values, key = rt.unique(columns,  return_inverse=True)

Here, unique_values is a tuple containing two arrays which can be zipped to get the unique tuples

4 Comments

Sorting will not be O(n) right in this scenario ? Where as hashing will be O(n+k) isn't it ?
But this looks a lot neater.
Correct. The sorting idea was just an addendum which might help with memory if that's an issue. The array split he is looking for is achieved in the first two lines (I used a dict for ease of reference but actually a list is as good since the keys are just integers from 0 to N-1. I can't see it becoming more succinct without assumptions about the data, e.g., they are all single-digit integers like in @Pablo C's solution.
This solution is actually disappointingly slow. Pandas appears to have more powerful tools for this. Editing my answer with a routine I have been using for years.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.