Python Pandas remove rows containing values from a list

Question

I am comparing two large CSVs with Pandas both containing contact information. I want to remove any rows from one CSV that contain any of the email addresses from the other CSV.

So if I had

DF1

name phone email
1    1     [email protected]
2    2     [email protected]
3    3     [email protected]

DF2

name phone email
x    y     [email protected]
a    b     [email protected]

I would be left with

DF3

name phone email
1    1     [email protected]

I don't care about any columns except the email addresses. This seems like it would be easy, but I'm really struggling with this one.

Here is what I have, but I don't think this is even close:

def remove_warm_list_duplicates(dataframe):
    '''Remove rows that have emails from the warmlist'''
    warm_list = pd.read_csv(r'warmlist/' + 'warmlist.csv'
                            , encoding="ISO-8859-1"
                            , error_bad_lines=False)
    warm_list_emails = warm_list['Email Address'].tolist()
    dataframe = dataframe[dataframe['Email Address'].isin(warm_list_emails) == False]

Vaishali · Accepted Answer · 2017-03-03 00:47:50Z

9

You can use pandas isin()

df3 = df1[~df1['email'].isin(df2['email'])]

Resulting df

    name    phone   email
0   1       1       [email protected]

answered Mar 3, 2017 at 0:47

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MaxU - stand with Ukraine · Accepted Answer · 2017-03-02 23:44:05Z

1

try this:

In [143]: pd.merge(df1, df2[['email']], on='email', how='left', indicator=True) \
            .query("_merge == 'left_only'") \
            .drop('_merge',1)
Out[143]:
   name  phone      email
0     1      1  [email protected]

answered Mar 2, 2017 at 23:44

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

jacobherrington Over a year ago

This is the answer I went with, but I think several approaches work.

pml · Accepted Answer · 2017-03-02 23:49:25Z

1

You could simplify a bit with unique() and sets:

warm_list = pd.read_csv(r'warmlist/' + 'warmlist.csv',
                        encoding="ISO-8859-1",
                        error_bad_lines=False)

warm_list_emails = set(warm_list['Email Address'].unique())
df = df.loc[df['Email Address'].isin(warm_list_emails), :]

answered Mar 2, 2017 at 23:49

pml

5143 silver badges12 bronze badges

Collectives™ on Stack Overflow

Python Pandas remove rows containing values from a list

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related