Efficiently filter DataFrame by looking for NumPy array match in row

Question

Given

df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
                  index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])

I would like to filter for np.array(['1', '2.3']). I can do

df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]

but is this the fastest way to do it?

EDIT: Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!

DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.

Are you looking for exact matches? I would be careful with floats, as I regard something like 2.300000001 to be equal to 2.3 in most domains. Maybe you can first concat all the arrays into a matrix then do a subtraction and then filter on the abs differences? That would be the usual solution. It seems strange to me that you have a list of numpy arrays. Its better to have a single numpy array as only then the np operations are really efficient — logical x 2
– logical x 2, Commented Aug 6, 2021 at 9:16
@logicalx2 I have edited the question changing the floats into strings. — user270199
– user270199, Commented Aug 6, 2021 at 9:29
@ShubhamSharma not restricted to just 2 values, no. However, not greater than 10. — user270199
– user270199, Commented Aug 6, 2021 at 9:29
@user270199 What is the data size(Number of rows in column x) that you are dealing with? — Shubham Sharma
– Shubham Sharma, Commented Aug 6, 2021 at 9:53

Anurag Dabas · Accepted Answer · 2021-08-06 09:55:14Z

1

You can rely on list comprehension for performance:

df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]

Performance via timeit(on my system currently using 4gb ram) :

%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

edited Aug 6, 2021 at 9:55

answered Aug 6, 2021 at 9:16

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shubham Sharma Over a year ago

We need to check for equality rather than inclusion A in B will test for membership of elements of array A in B.

Shubham Sharma Over a year ago

You still have to remove in x

Anurag Dabas Over a year ago

@ShubhamSharma sir updated answer....thanks for noticing :)

logical x 2 · Accepted Answer · 2021-08-06 10:02:41Z

0

My suggestion would be to do the following:

import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])

In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.

Then do:

 matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)

Which outputs you an array of bools indicating where the matches are located. So you can then filter your rows like this:

df[matching_rows]

edited Aug 6, 2021 at 10:02

answered Aug 6, 2021 at 9:57

logical x 2

3214 silver badges19 bronze badges

1 Comment

logical x 2 Over a year ago

I suspect that this should be faster than a python loop over a lot of small numpy arrays. Think about it like this: The numpy arrays form contiguous sections of memory and to have optimal cache locality, you really want to have "all current data nearby". This is why you really want to avoid having lots of small numpy arrays at all costs. So a type like List[np.array] is already slightly fishy.

Collectives™ on Stack Overflow

Efficiently filter DataFrame by looking for NumPy array match in row

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related