2

Given

df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
                  index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])

I would like to filter for np.array(['1', '2.3']). I can do

df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]

but is this the fastest way to do it?

EDIT: Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!

DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.

6
  • Are all the arrays in column x of equal shape i.e 2? Commented Aug 6, 2021 at 9:14
  • Are you looking for exact matches? I would be careful with floats, as I regard something like 2.300000001 to be equal to 2.3 in most domains. Maybe you can first concat all the arrays into a matrix then do a subtraction and then filter on the abs differences? That would be the usual solution. It seems strange to me that you have a list of numpy arrays. Its better to have a single numpy array as only then the np operations are really efficient Commented Aug 6, 2021 at 9:16
  • @logicalx2 I have edited the question changing the floats into strings. Commented Aug 6, 2021 at 9:29
  • @ShubhamSharma not restricted to just 2 values, no. However, not greater than 10. Commented Aug 6, 2021 at 9:29
  • @user270199 What is the data size(Number of rows in column x) that you are dealing with? Commented Aug 6, 2021 at 9:53

2 Answers 2

1

You can rely on list comprehension for performance:

df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]

Performance via timeit(on my system currently using 4gb ram) :

%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)

%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
Sign up to request clarification or add additional context in comments.

3 Comments

We need to check for equality rather than inclusion A in B will test for membership of elements of array A in B.
You still have to remove in x
@ShubhamSharma sir updated answer....thanks for noticing :)
0

My suggestion would be to do the following:

import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])

In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.

Then do:

 matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)

Which outputs you an array of bools indicating where the matches are located. So you can then filter your rows like this:

df[matching_rows]

1 Comment

I suspect that this should be faster than a python loop over a lot of small numpy arrays. Think about it like this: The numpy arrays form contiguous sections of memory and to have optimal cache locality, you really want to have "all current data nearby". This is why you really want to avoid having lots of small numpy arrays at all costs. So a type like List[np.array] is already slightly fishy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.