0

Python and Pandas rookie here! I'm trying to transpose a dataframe that contains a million records using a for loop. As you can imagine, it's painstakingly slow. Please see below for my process and code.

There are two dataframes i'm working with: transactions - which contains the customer_id, and the category they purchased from.

transactions=pandas.DataFrame({'a':['johnny','sally','maggy','lassy','johnny','sally','maggy'],
'category':['fruits','fruits','spices','veggies','veggies','spices','snacks']})

category_list - which contains all categories a customer could purchase from.

category_list=pandas.DataFrame({'category':['fruits','spices','veggies','snacks','drinks','alcohol','adult']})

For each customer, if the customer has (ever) made a purchase in a given category, then assign a value 1. If not, then assign value of 0.

Code:

cust_list = transactions['a'].unique()
final_data = pandas.DataFrame()

for i in cust_list:
    step1 = transactions[transactions.a == i]
    step1 = step1.drop_duplicates()
    step1['value'] = 1
    cat_merge = pandas.merge(step1, category_list, how='right', left_on='category', right_on='category')
    cat_merge['a'] = i
    cat_merge = cat_merge.fillna(0)
    cat_merge_transpose = pandas.DataFrame(cat_merge.transpose())
    cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
    cat_merge_transpose.columns = cat_merge_transpose.iloc[0]
    cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
    cat_merge_transpose.reset_index()
    cat_merge_transpose.insert(0, 'a', i)
    final_data = final_data.append(pandas.DataFrame(data = cat_merge_transpose), ignore_index=True)

So in this case the result would look like this:

print final_data

Any help i can get to optimize this and make it run significantly faster, with fewer lines of code will be very much appreciated.

Thank you.

2 Answers 2

3

Your problem can be viewed as a pivot operation, and here we could use pivot_table:

>>> df["value"] = 1
>>> P = df.pivot_table(index="a", columns="category", values="value", aggfunc=max)
>>> P.loc[:,category_list.category.unique()].fillna(0)
category  fruits  spices  veggies  snacks  drinks  alcohol  adult
a                                                                
johnny         1       0        1       0       0        0      0
lassy          0       0        1       0       0        0      0
maggy          0       1        0       1       0        0      0
sally          1       1        0       0       0        0      0

The pivot_table itself gives us

>>> P
category  fruits  snacks  spices  veggies
a                                        
johnny         1     NaN     NaN        1
lassy        NaN     NaN     NaN        1
maggy        NaN       1       1      NaN
sally          1     NaN       1      NaN

and then we index into this using all the category columns (including the ones which weren't seen), calling fillna to replace the NaNs with 0.

Sign up to request clarification or add additional context in comments.

3 Comments

Wow! This is crazy. Even with significantly less code!
I want to give both the tick but i'm going to give the tick to the one that runs fastest. You guys are awesome.
Yours took less than a minute for almost 2 million records! I give the nod to DSM for the performance! Other solution took 10 minutes, which is still 100,000 times faster than my for loop! Stack rocks!
1
# Get a unique list of all category items.
categories = category_list.category.unique().tolist()

# For transactions with a given customer matching any category, assign a value of one.
transactions['value'] = transactions.groupby('a').category.transform(
                            lambda s: s.isin(categories).any()).astype(int)
output = transactions.groupby(['a', 'category']).max().unstack().fillna(0)
output.columns = output.columns.droplevel()
zero_cols = [c for c in categories if c not in output]
for col in zero_cols:
    output[col] = 0
>>> output
category  fruits  snacks  spices  veggies  drinks  alcohol  adult
a                                                                
johnny         1       0       0        1       0        0      0
lassy          0       0       0        1       0        0      0
maggy          0       1       1        0       0        0      0
sally          1       0       1        0       0        0      0

3 Comments

Thanks. This however doesn't get me the transposed matrix.
Thanks Alexander. Used your solution against my sample data and the results are not the same as the code i've posted. My final output should be the same as what I have in final_data
Sorry about that. The values are if you've ordered from that category. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.