Pandas populate new dataframe column based on matching columns in another dataframe

Question

I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my df called category. The category is a column in df2 which contains around 700 rows and two other columns that will match with two columns in df.

I begin with setting an index in df2 and df that will match between the frames, however some of the index in df2 doesn't exist in df.

The remaining columns in df2 are called AUTHOR_NAME and CATEGORY.

The relevant column in df is called AUTHOR_NAME.

Some of the AUTHOR_NAME in df doesn't exist in df2 and vice versa.

The instruction I want is: when index in df matches with index in df2 and title in df matches with title in df2, add category to df, else add NaN in category.

Example data:

df2
           AUTHOR_NAME              CATEGORY
Index       
Pub1        author1                 main
Pub2        author1                 main
Pub3        author1                 main
Pub1        author2                 sub
Pub3        author2                 sub
Pub2        author4                 sub


df
            AUTHOR_NAME     ...n amount of other columns        
Index       
Pub1        author1                 
Pub2        author1     
Pub1        author2 
Pub1        author3
Pub2        author4 

expected_result
            AUTHOR_NAME             CATEGORY   ...n amount of other columns
Index
Pub1        author1                 main
Pub2        author1                 main
Pub1        author2                 sub
Pub1        author3                 NaN
Pub2        author4                 sub

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) my df becomes three times bigger than it is supposed to be.

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2 as a lookup table and then return type values to df depending on if certain conditions are met.

def calculate_category(df2, d):
    category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
    return str(category_row['CATEGORY'].iat[0])

df.apply(lambda d: calculate_category(df2, d), axis=1)

However, this throws me an error:

IndexError: ('index out of bounds', u'occurred at index 7614')

I'm not sure if on and left_index/right_index work together. Maybe you need on=['Index', 'AUTHOR_NAME'] (or something similar). And I not sure which dataframe is left in df2.merge(df,...). Maybe you need how="right" or pd.merge(left=df, right=df2, ...) — furas
– furas, Commented Oct 2, 2016 at 12:25

LW001 · Accepted Answer · 2017-09-26 19:20:30Z

48

Consider the following dataframes df and df2

df = pd.DataFrame(dict(
        AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
        title=      list('zyxwvutsrqponml')
    ))

df2 = pd.DataFrame(dict(
        AUTHOR_NAME=list('AABCCEGG'),
        title      =list('zwvtrpml'),
        CATEGORY   =list('11223344')
    ))

option 1
merge

df.merge(df2, how='left')

option 2
join

cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)

both options yield

edited Sep 26, 2017 at 19:20

LW001

3,0117 gold badges35 silver badges44 bronze badges

answered Oct 2, 2016 at 14:50

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

A.Shoman Over a year ago

Just a quick note .. that worked for me fine .. but in some cases I ended up having duplicates .. still not sure what is the reason for that .. but anyhow, drop the duplicates keeping only one (keep = first OR keep = last) .. check the documentation here pandas.pydata.org/pandas-docs/stable/generated/…

Nickil Maveli · Accepted Answer · 2016-10-02 14:40:02Z

2

APPROACH 1:

You could use concat instead and drop the duplicated values present in both Index and AUTHOR_NAME columns combined. After that, use isin for checking membership:

df_concat = pd.concat([df2, df]).reset_index().drop_duplicates(['Index', 'AUTHOR_NAME'])
df_concat.set_index('Index', inplace=True)
df_concat[df_concat.index.isin(df.index)]

Note: The column Index is assumed to be set as the index column for both the DF's.

APPROACH 2:

Use join after setting the index column correctly as shown:

df2.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
df.set_index(['Index', 'AUTHOR_NAME'], inplace=True)

df.join(df2).reset_index()

edited Oct 2, 2016 at 14:40

answered Oct 2, 2016 at 12:27

Nickil Maveli

29.8k10 gold badges86 silver badges88 bronze badges

2 Comments

Nickil Maveli Over a year ago

Well, it works for me though. Did you use the appropriate names for the DFs and also Is your Index column in both frames set as the index?Please double check. Let me edit the post with the names set by you.

user3471881 Over a year ago

Yes, the Index column in both frames is set as the index and I used the appropriate names for DFs

Bhagabat Behera · Accepted Answer · 2018-06-24 12:26:04Z

1

You may try the following. It will merge both the datasets on specified column as key.

expected_result = pd.merge(df, df2, on = 'CATEGORY', how = 'left')

edited Jun 24, 2018 at 12:26

answered Jun 24, 2018 at 11:48

Bhagabat Behera

8717 silver badges7 bronze badges

1 Comment

WhoIsJack Over a year ago

As pointed out in comments on previous answers of yours, please provide further context with your answers and format them properly. Otherwise, they will have to be flagged for removal due to low quality. Thank you for helping us keep StackOverflow answers helpful and of high quality!

kiltannen · Accepted Answer · 2018-05-25 10:15:49Z

-1

While the other answers here give very good and elegant solutions to the asked question, I have found a resource that both answers this question in an extremely elegant fashion, as well as giving a beautifully clear and straightforward set of examples on how to accomplish join/ merge of dataframes, effectively teaching LEFT, RIGHT, INNER and OUTER joins.

Join And Merge Pandas Dataframe

I honestly feel any further seekers after this topic will want to also examine his examples...

answered May 25, 2018 at 10:15

kiltannen

1,1572 gold badges14 silver badges30 bronze badges

1 Comment

Teepeemm Over a year ago

But we usually prefer that answers live here, not behind a link to elsewhere. The only thing I really get from your post is that the solution lies in using join and/or merge.

Collectives™ on Stack Overflow

Pandas populate new dataframe column based on matching columns in another dataframe

4 Answers 4

1 Comment

2 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related