I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...
array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...
df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]
where y is ...
[array([[0, 0, 1, 2, 0]]),
array([[0, 1, 0, 0, 1],
[0, 1, 3, 2, 1],
[0, 1, 0, 0, 1]]),
array([[ 1, 1, 0, 1, 11],
[ 1, 1, 1, 0, 11],
[ 1, 1, 0, 1, 11]])]
This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.
Any ideas on how to speed this up?