Efficiently find index of DataFrame values in array

Question

I have a DataFrame that resembles:

x     y     z
--------------
0     A     10
0     D     13
1     X     20
...

and I have two sorted arrays for every possible value for x and y:

x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]

so I wrote a function:

def lookup(record, lookup_list, lookup_attr):
    return np.searchsorted(lookup_list, getattr(record, lookup_attr))

and then call:

df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')

# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]

but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?

I tried:

np.where(np.in1d(x_values, df.x))[0]

but this removes duplicate values and that is not desired.

cs95 · Accepted Answer · 2018-12-18 15:19:17Z

4

You can convert your index arrays to pd.Index objects to make lookup fast(er).

u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})

   x  y
0  0  1
1  0  2
2  1  3

Where,

x_values
# [0, 1]

y_values
# ['a', 'A', 'D', 'X']

As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.

val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
    f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})

   x  y
0  0  1
1  0  2
2  1  3

edited Dec 18, 2018 at 15:19

answered Dec 18, 2018 at 15:14

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

SumNeuron Over a year ago

That is clever! and also something I have not seen in the previous S.O. posts I read

cs95 Over a year ago

@SumNeuron If it helps, please consider accepting the answer. Thanks.

SumNeuron Over a year ago

will do, but I have been advised in the past not to immediately accept answers, but to give some time for others in the community to chime in other ideas as well :)

BENY · Accepted Answer · 2018-12-18 15:23:00Z

2

Update using Series with .loc , you may can also try with reindex

pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]

edited Dec 18, 2018 at 15:23

answered Dec 18, 2018 at 15:11

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

SumNeuron Over a year ago

I think I might not have be clear about which indices I am trying to retrieve. I would like the indices for columns x and y based on the position their values occur at in x_values, and y_values so I would expect that the returned indicies for x (based off this trivial example) to be [0, 0, 1, ...]

Collectives™ on Stack Overflow

Efficiently find index of DataFrame values in array

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related