For Loop alternative Pandas Python

Question

Python and Pandas rookie here! I'm trying to transpose a dataframe that contains a million records using a for loop. As you can imagine, it's painstakingly slow. Please see below for my process and code.

There are two dataframes i'm working with: transactions - which contains the customer_id, and the category they purchased from.

transactions=pandas.DataFrame({'a':['johnny','sally','maggy','lassy','johnny','sally','maggy'],
'category':['fruits','fruits','spices','veggies','veggies','spices','snacks']})

category_list - which contains all categories a customer could purchase from.

category_list=pandas.DataFrame({'category':['fruits','spices','veggies','snacks','drinks','alcohol','adult']})

For each customer, if the customer has (ever) made a purchase in a given category, then assign a value 1. If not, then assign value of 0.

Code:

cust_list = transactions['a'].unique()
final_data = pandas.DataFrame()

for i in cust_list:
    step1 = transactions[transactions.a == i]
    step1 = step1.drop_duplicates()
    step1['value'] = 1
    cat_merge = pandas.merge(step1, category_list, how='right', left_on='category', right_on='category')
    cat_merge['a'] = i
    cat_merge = cat_merge.fillna(0)
    cat_merge_transpose = pandas.DataFrame(cat_merge.transpose())
    cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
    cat_merge_transpose.columns = cat_merge_transpose.iloc[0]
    cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
    cat_merge_transpose.reset_index()
    cat_merge_transpose.insert(0, 'a', i)
    final_data = final_data.append(pandas.DataFrame(data = cat_merge_transpose), ignore_index=True)

So in this case the result would look like this:

print final_data

Any help i can get to optimize this and make it run significantly faster, with fewer lines of code will be very much appreciated.

Thank you.

DSM · Accepted Answer · 2016-02-01 03:01:16Z

3

Your problem can be viewed as a pivot operation, and here we could use pivot_table:

>>> df["value"] = 1
>>> P = df.pivot_table(index="a", columns="category", values="value", aggfunc=max)
>>> P.loc[:,category_list.category.unique()].fillna(0)
category  fruits  spices  veggies  snacks  drinks  alcohol  adult
a                                                                
johnny         1       0        1       0       0        0      0
lassy          0       0        1       0       0        0      0
maggy          0       1        0       1       0        0      0
sally          1       1        0       0       0        0      0

The pivot_table itself gives us

>>> P
category  fruits  snacks  spices  veggies
a                                        
johnny         1     NaN     NaN        1
lassy        NaN     NaN     NaN        1
maggy        NaN       1       1      NaN
sally          1     NaN       1      NaN

and then we index into this using all the category columns (including the ones which weren't seen), calling fillna to replace the NaNs with 0.

answered Feb 1, 2016 at 3:01

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BlackHat Over a year ago

Wow! This is crazy. Even with significantly less code!

BlackHat Over a year ago

I want to give both the tick but i'm going to give the tick to the one that runs fastest. You guys are awesome.

BlackHat Over a year ago

Yours took less than a minute for almost 2 million records! I give the nod to DSM for the performance! Other solution took 10 minutes, which is still 100,000 times faster than my for loop! Stack rocks!

Alexander · Accepted Answer · 2016-02-01 02:21:41Z

1

# Get a unique list of all category items.
categories = category_list.category.unique().tolist()

# For transactions with a given customer matching any category, assign a value of one.
transactions['value'] = transactions.groupby('a').category.transform(
                            lambda s: s.isin(categories).any()).astype(int)
output = transactions.groupby(['a', 'category']).max().unstack().fillna(0)
output.columns = output.columns.droplevel()
zero_cols = [c for c in categories if c not in output]
for col in zero_cols:
    output[col] = 0
>>> output
category  fruits  snacks  spices  veggies  drinks  alcohol  adult
a                                                                
johnny         1       0       0        1       0        0      0
lassy          0       0       0        1       0        0      0
maggy          0       1       1        0       0        0      0
sally          1       0       1        0       0        0      0

edited Feb 1, 2016 at 2:21

answered Feb 1, 2016 at 1:59

Alexander

111k32 gold badges212 silver badges208 bronze badges

3 Comments

BlackHat Over a year ago

Thanks. This however doesn't get me the transposed matrix.

BlackHat Over a year ago

Thanks Alexander. Used your solution against my sample data and the results are not the same as the code i've posted. My final output should be the same as what I have in final_data

BlackHat Over a year ago

Sorry about that. The values are if you've ordered from that category. Thanks.

Collectives™ on Stack Overflow

For Loop alternative Pandas Python

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related