2

It was hard for me to come up with clear title but an example should make things more clear.

Index C1
1     [dinner]
2     [brunch, food]
3     [dinner, fancy]

Now, I'd like to create a set of binary features for each of the unique values in this column.

The example above would turn into:

Index C1               dinner  brunch  fancy food
1     [dinner]         1       0       0     0
2     [brunch, food]   0       1       0     1
3     [dinner, fancy]  1       0       1     0

Any help would be much appreciated.

3

2 Answers 2

2

For a performant solution, I recommend creating a new DataFrame by listifying your column.

pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')

   brunch  dinner  fancy  food
0       0       1      0     0
1       1       0      0     1
2       0       1      1     0

This is going to be so much faster than apply(pd.Series).

This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:

(pd.get_dummies(
    pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
   .groupby(level=0, axis=1)
   .sum())

Well, if your data is like this, then what you're looking for isn't "binary" anymore.

Sign up to request clarification or add additional context in comments.

Comments

2

Maybe using MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]: 
   Index  brunch  dinner  fancy  food
0      1       0       1      0     0
1      2       1       0      0     1
2      3       0       1      1     0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.