Python and Pandas rookie here! I'm trying to transpose a dataframe that contains a million records using a for loop. As you can imagine, it's painstakingly slow. Please see below for my process and code.
There are two dataframes i'm working with: transactions - which contains the customer_id, and the category they purchased from.
transactions=pandas.DataFrame({'a':['johnny','sally','maggy','lassy','johnny','sally','maggy'],
'category':['fruits','fruits','spices','veggies','veggies','spices','snacks']})
category_list - which contains all categories a customer could purchase from.
category_list=pandas.DataFrame({'category':['fruits','spices','veggies','snacks','drinks','alcohol','adult']})
For each customer, if the customer has (ever) made a purchase in a given category, then assign a value 1. If not, then assign value of 0.
Code:
cust_list = transactions['a'].unique()
final_data = pandas.DataFrame()
for i in cust_list:
step1 = transactions[transactions.a == i]
step1 = step1.drop_duplicates()
step1['value'] = 1
cat_merge = pandas.merge(step1, category_list, how='right', left_on='category', right_on='category')
cat_merge['a'] = i
cat_merge = cat_merge.fillna(0)
cat_merge_transpose = pandas.DataFrame(cat_merge.transpose())
cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
cat_merge_transpose.columns = cat_merge_transpose.iloc[0]
cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
cat_merge_transpose.reset_index()
cat_merge_transpose.insert(0, 'a', i)
final_data = final_data.append(pandas.DataFrame(data = cat_merge_transpose), ignore_index=True)
So in this case the result would look like this:
print final_data
Any help i can get to optimize this and make it run significantly faster, with fewer lines of code will be very much appreciated.
Thank you.