I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my df called category. The category is a column in df2 which contains around 700 rows and two other columns that will match with two columns in df.
I begin with setting an index in df2 and df that will match between the frames, however some of the index in df2 doesn't exist in df.
The remaining columns in df2 are called AUTHOR_NAME and CATEGORY.
The relevant column in df is called AUTHOR_NAME.
Some of the AUTHOR_NAME in df doesn't exist in df2 and vice versa.
The instruction I want is: when index in df matches with index in df2 and title in df matches with title in df2, add category to df, else add NaN in category.
Example data:
df2
AUTHOR_NAME CATEGORY
Index
Pub1 author1 main
Pub2 author1 main
Pub3 author1 main
Pub1 author2 sub
Pub3 author2 sub
Pub2 author4 sub
df
AUTHOR_NAME ...n amount of other columns
Index
Pub1 author1
Pub2 author1
Pub1 author2
Pub1 author3
Pub2 author4
expected_result
AUTHOR_NAME CATEGORY ...n amount of other columns
Index
Pub1 author1 main
Pub2 author1 main
Pub1 author2 sub
Pub1 author3 NaN
Pub2 author4 sub
If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) my df becomes three times bigger than it is supposed to be.
So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2 as a lookup table and then return type values to df depending on if certain conditions are met.
def calculate_category(df2, d):
category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]
return str(category_row['CATEGORY'].iat[0])
df.apply(lambda d: calculate_category(df2, d), axis=1)
However, this throws me an error:
IndexError: ('index out of bounds', u'occurred at index 7614')



onandleft_index/right_indexwork together. Maybe you needon=['Index', 'AUTHOR_NAME'](or something similar). And I not sure which dataframe is left indf2.merge(df,...). Maybe you needhow="right"orpd.merge(left=df, right=df2, ...)