2

I have a dataframe of users who purchased various items. I want to breakout that list of values into separate columns and have a binary flag for users who purchased that item.

Input:

       A           B
0  James  [123, 456]
1   Mary       [123]
2   John  [456, 789]

Expected Output:

       A           B  123  456  789
0  James  [123, 456]    1    1    0
1   Mary       [123]    1    0    0
2   John  [456, 789]    0    1    1

What I've tried (step by step)

df['B'].explode() is my first step:

0    123
0    456
1    123
2    456
2    789

The using get_dummies() pd.get_dummies(df['B'].explode()):

   123  456  789
0    1    0    0
0    0    1    0
1    1    0    0
2    0    1    0
2    0    0    1

Join it together on index df.join(pd.get_dummies(df['B'].explode())):

       A           B  123  456  789
0  James  [123, 456]    1    0    0
0  James  [123, 456]    0    1    0
1   Mary       [123]    1    0    0
2   John  [456, 789]    0    1    0
2   John  [456, 789]    0    0    1

Problem:

Now I just need to groupby and combine. However, with millions and millions of rows and customers buying 100s of products, this method of joining/combining is highly inefficient. Is there a more "pandas-friendly" or built in function that does this?

1
  • for performance you can also try using this solution using MultiLabelBinarizer Commented Apr 7, 2020 at 14:19

1 Answer 1

7

you can replace pd.get_dummies(df['B'].explode() with pd.get_dummies(df.B.explode()).sum(level=0) and join.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.