1

I have a DataFrame that resembles:

x     y     z
--------------
0     A     10
0     D     13
1     X     20
...

and I have two sorted arrays for every possible value for x and y:

x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]

so I wrote a function:

def lookup(record, lookup_list, lookup_attr):
    return np.searchsorted(lookup_list, getattr(record, lookup_attr))

and then call:

df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')

# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]

but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?

I tried:

np.where(np.in1d(x_values, df.x))[0]

but this removes duplicate values and that is not desired.

2 Answers 2

4

You can convert your index arrays to pd.Index objects to make lookup fast(er).

u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})

   x  y
0  0  1
1  0  2
2  1  3

Where,

x_values
# [0, 1]

y_values
# ['a', 'A', 'D', 'X']

As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.

val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
    f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})

   x  y
0  0  1
1  0  2
2  1  3
Sign up to request clarification or add additional context in comments.

3 Comments

That is clever! and also something I have not seen in the previous S.O. posts I read
@SumNeuron If it helps, please consider accepting the answer. Thanks.
will do, but I have been advised in the past not to immediately accept answers, but to give some time for others in the community to chime in other ideas as well :)
2

Update using Series with .loc , you may can also try with reindex

pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]

1 Comment

I think I might not have be clear about which indices I am trying to retrieve. I would like the indices for columns x and y based on the position their values occur at in x_values, and y_values so I would expect that the returned indicies for x (based off this trivial example) to be [0, 0, 1, ...]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.