Python - Attempting to create binary features from a column with lists of strings

Question

It was hard for me to come up with clear title but an example should make things more clear.

Index C1
1     [dinner]
2     [brunch, food]
3     [dinner, fancy]

Now, I'd like to create a set of binary features for each of the unique values in this column.

The example above would turn into:

Index C1               dinner  brunch  fancy food
1     [dinner]         1       0       0     0
2     [brunch, food]   0       1       0     1
3     [dinner, fancy]  1       0       1     0

Any help would be much appreciated.

Possible duplicate of Pandas convert a column of list to dummies — Lev Zakharov
– Lev Zakharov, Commented Aug 13, 2018 at 0:50
Look up creating dummy variables in python. Plenty of material out there on this already. stackoverflow.com/questions/11587782/… — Eric
– Eric, Commented Aug 13, 2018 at 0:51
Possible duplicate of Creating dummy variables in pandas for python — Eric
– Eric, Commented Aug 13, 2018 at 0:51

cs95 · Accepted Answer · 2018-08-13 00:55:28Z

2

For a performant solution, I recommend creating a new DataFrame by listifying your column.

pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')

   brunch  dinner  fancy  food
0       0       1      0     0
1       1       0      0     1
2       0       1      1     0

This is going to be so much faster than apply(pd.Series).

This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:

(pd.get_dummies(
    pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
   .groupby(level=0, axis=1)
   .sum())

Well, if your data is like this, then what you're looking for isn't "binary" anymore.

answered Aug 13, 2018 at 0:55

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BENY · Accepted Answer · 2018-08-13 01:41:05Z

2

Maybe using MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]: 
   Index  brunch  dinner  fancy  food
0      1       0       1      0     0
1      2       1       0      0     1
2      3       0       1      1     0

answered Aug 13, 2018 at 1:41

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

Python - Attempting to create binary features from a column with lists of strings

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related