Pandas DataFrame filter rows using another DataFrame Column

Question

import pandas as pd
import numpy as np

dict1 = {'col1': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B' ], 
       'col2':[2, 2, 2, 3, 3, 2, 2, 3, 3 , 3], 
       'col3':[0.7, 0.8, 0.9, 0.95, 0.85, 0.65, 0.75, 0.45, 0.55, 0.75 ],
       'col4':[100,200,300,400,500,600,700,800,900,1000]}
df1 = pd.DataFrame(data=dict1)
df1

dict2 = {'col1': ['A', 'B' ], 
       'col2':[0.75, 0.65], 
       'col3':[1000, 2000 ],
       'col4':[0.8, 0.9]}
df2 = pd.DataFrame(data=dict2)
df2

In fastest way how to filter df1 using df2, depending on df1['col3'] >= df2['col2'] for equal col1s?

Intended outcome

>>> df1
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

My attempt gave the following error

>>> df1= df1[df1['col3'] >= float(df2[df2['col1']==df1['col1']]['col2'].values[0])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects

Congratulations on asking pandas Question 170,000! @Burcak You have just won a new car! ;) — David Erickson
– David Erickson, Commented Sep 18, 2020 at 1:02

sammywemmy · Accepted Answer · 2020-09-18 00:49:03Z

6

I will do merge

out = df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')).query('col3>=col21').drop('col21',1)

out
Out[15]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

Or reindex

out = df1[df1['col3'] >= df2.set_index('col1')['col2'].reindex(df1['col1']).values]
Out[19]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

You could also use map:

 df1.loc[df1.col3 >= df1.col1.map(df2.set_index("col1").col2)]

edited Sep 18, 2020 at 0:49

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

answered Sep 18, 2020 at 0:39

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

burcak Over a year ago

How is this col21 generated?

BENY Over a year ago

@burcak df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')) look at the suffixes~ if duplicate column it will add the str at the end

burcak Over a year ago

My aim is ro reduce runtime

BENY Over a year ago

@burcak we provide multiple solution , you can check which one is better , I recommend map or reindex then

burcak Over a year ago

Yes, I time them and map solution gave the best runtime. Thanks @BEN_YO

David Erickson · Accepted Answer · 2020-09-18 01:11:01Z

3

My method would be similar to @Ben_Yo 's merge answer, but more lines of code, but perhaps a little more straightforward.

You simply:

Merge the column in and create new dataframe s
Change the datafame s into a boolean series that returns True or False according to the condition, which in this case is s['col3'] >= s['col2']
Finally, pass s to df1, and the outcome will exclude rows that returned False in the boolean series s:

s = pd.merge(df1[['col1', 'col3']], df2[['col1', 'col2']], how='left', on='col1')
s = s['col3'] >= s['col2']
df1[s]
Out[1]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

edited Sep 18, 2020 at 1:11

answered Sep 18, 2020 at 0:51

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Collectives™ on Stack Overflow

Pandas DataFrame filter rows using another DataFrame Column

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related