Working with structured object arrays in NumPy

Question

Say, I have an array of (x, y) points of the following structure:

arr = np.array([([1.     ], [2.     ]),
                ([1., 93.], [5., 46.]),
                ([4.     ], [3.     ])],
               dtype=[('x','O'), ('y', 'O')])

i.e. these points are grouped into such innermost arrays. The size of the innermost array might by arbitrary, but it's always same for x and y.

I want to be able to perform two things:

a) Expand the innermost arrays by concatenating their content, so for the above example the result looks like:

np.array([( 1.,  2.),
          ( 1.,  5.),
          (93., 46.),
          ( 4.,  3.)],
         dtype=[('x','f8'), ('y','f8')])

b) For each (outermost) entry select element with, say, largest y:

np.array([( 1.,  2.),
          (93., 46.),
          ( 4.,  3.)],
         dtype=[('x','f8'), ('y','f8')])

I believe there should be a way of doing this efficiently without using ugly for loops. Would appreciate any help.

UPD ( a and b using ugly loops ):

(arr is the array defined in the beginning of the post)

a)

np.array([(x_, y_) for x, y in arr for x_, y_ in zip(x, y)], dtype=[('x','f8'), ('y','f8')])

b)

np.array([(x[np.argmax(np.array(y))], y[np.argmax(np.array(y))]) for x, y in arr],dtype=[('x','f8'), ('y','f8')])

Problem is also that in reality I have not just two fields (x and y), but 77 fields of various types (floats, integers, booleans)... So these expressions will grow to many lines.

Once you place Python lists into a NumPy array (resulting in an object dtype) you are forced to use Python loops to iterate over the items in the lists. — unutbu
– unutbu, Commented Jul 24, 2017 at 12:23
@unutbu would it make a change if they are not Python lists but NumPy arrays by themselves (still stored under object dtype)? — SiLiKhon
– SiLiKhon, Commented Jul 24, 2017 at 12:48
@SiLiKhon: For best performance with NumPy you want your data in a single big array of native dtype (not object). Only when the data is in one contiguous block of memory can NumPy leverage the dtype and shape of the data to perform fast vectorized operations. When you have a NumPy array of object dtype holding NumPy arrays, each subarray is in its own possibly discontiguous block of memory. NumPy can no longer perform any fast vectorized operation over the subarrays. It is done with essentially a Python loop. — unutbu
– unutbu, Commented Jul 24, 2017 at 14:52

unutbu · Accepted Answer · 2017-07-24 18:19:31Z

1

Using Pandas, you could store your data in a flat DataFrame, using the group value to indicate which row of the original array the data came from:

import numpy as np
import pandas as pd
df = pd.DataFrame([
    (0, 1, 2),
    (1, 1, 5),
    (1, 93, 46),
    (2, 4, 3)], dtype='f8', columns=['group', 'x', 'y'])
print(df)
#    group     x     y
# 0    0.0   1.0   2.0
# 1    1.0   1.0   5.0
# 2    1.0  93.0  46.0
# 3    2.0   4.0   3.0

Then the first operation is merely a slice of the x and y columns:

print(df[['x','y']])
#       x     y
# 0   1.0   2.0
# 1   1.0   5.0
# 2  93.0  46.0
# 3   4.0   3.0

and the second operation can be done using groupby/idxmax:

print(df.loc[df.groupby('group')['y'].idxmax(), ['x', 'y']])
#       x     y
# 0   1.0   2.0
# 2  93.0  46.0
# 3   4.0   3.0

Given the structured NumPy array, arr, you're going to have to loop through the lists at least once to perform any of these operations. So you might as well pay the price once to organize the data in a better data structure, such as a Pandas DataFrame.

Here is one way you could convert arr to df:

import numpy as np
import pandas as pd

arr = np.array([([1.     ], [2.     ]),
                ([1., 93.], [5., 46.]),
                ([4.     ], [3.     ])],
               dtype=[('x','O'), ('y', 'O')])

df = pd.DataFrame(arr)
df = (pd.concat({col: df[col].apply(pd.Series).stack() for col in df}, axis=1)
      .reset_index(drop=True))
print(df)

yields

      x     y
0   1.0   2.0
1   1.0   5.0
2  93.0  46.0
3   4.0   3.0

edited Jul 24, 2017 at 18:19

answered Jul 24, 2017 at 15:05

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SiLiKhon Over a year ago

Nice way, thanks! The problem is that I get the data from outside and they're in the format I described in OP, which is a pain...

Collectives™ on Stack Overflow

Working with structured object arrays in NumPy

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related