8
import pandas as pd
import numpy as np

dict1 = {'col1': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B' ], 
       'col2':[2, 2, 2, 3, 3, 2, 2, 3, 3 , 3], 
       'col3':[0.7, 0.8, 0.9, 0.95, 0.85, 0.65, 0.75, 0.45, 0.55, 0.75 ],
       'col4':[100,200,300,400,500,600,700,800,900,1000]}
df1 = pd.DataFrame(data=dict1)
df1

dict2 = {'col1': ['A', 'B' ], 
       'col2':[0.75, 0.65], 
       'col3':[1000, 2000 ],
       'col4':[0.8, 0.9]}
df2 = pd.DataFrame(data=dict2)
df2

In fastest way how to filter df1 using df2, depending on df1['col3'] >= df2['col2'] for equal col1s?

Intended outcome

>>> df1
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

My attempt gave the following error

>>> df1= df1[df1['col3'] >= float(df2[df2['col1']==df1['col1']]['col2'].values[0])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
1
  • 1
    Congratulations on asking pandas Question 170,000! @Burcak You have just won a new car! ;) Commented Sep 18, 2020 at 1:02

2 Answers 2

6

I will do merge

out = df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')).query('col3>=col21').drop('col21',1)

out
Out[15]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

Or reindex

out = df1[df1['col3'] >= df2.set_index('col1')['col2'].reindex(df1['col1']).values]
Out[19]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

You could also use map:

 df1.loc[df1.col3 >= df1.col1.map(df2.set_index("col1").col2)]
Sign up to request clarification or add additional context in comments.

5 Comments

How is this col21 generated?
@burcak df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')) look at the suffixes~ if duplicate column it will add the str at the end
My aim is ro reduce runtime
@burcak we provide multiple solution , you can check which one is better , I recommend map or reindex then
Yes, I time them and map solution gave the best runtime. Thanks @BEN_YO
3

My method would be similar to @Ben_Yo 's merge answer, but more lines of code, but perhaps a little more straightforward.

You simply:

  1. Merge the column in and create new dataframe s
  2. Change the datafame s into a boolean series that returns True or False according to the condition, which in this case is s['col3'] >= s['col2']
  3. Finally, pass s to df1, and the outcome will exclude rows that returned False in the boolean series s:

s = pd.merge(df1[['col1', 'col3']], df2[['col1', 'col2']], how='left', on='col1')
s = s['col3'] >= s['col2']
df1[s]
Out[1]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.