Binary-vectorize pandas DataFrame column

Question

In a fictional patients dataset one might encounter the following table:

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})

Which renders the following dataset:

Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a &) and that there exists a complete list diseases of the illnesses, I've yet to find a simple solution to applying to these situations pandas.get_dummies one-hot encoder to obtain a binary vector for each patient.

How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame?

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Cooties":[1, 0, 1],
    "Dragon Pox":[0, 1, 0],
    "Greyscale":[0, 0, 1]
})

Vaishali · Accepted Answer · 2019-05-12 14:00:46Z

6

You can use Series.str.get_dummies with right separator,

df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index()

    Patients    Cooties Dragon Pox  Greycale
0   Luke        1       0           0
1   Nigel       0       1           0
2   Sarah       1       0           1

edited May 12, 2019 at 14:00

answered May 12, 2019 at 13:55

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Erfan Over a year ago

Great answer, didnt know about str.get_dummies +1 for better answer

akalanka Over a year ago

What is the importance of ' & ' arg in get_dummies? Is this arg related to this question context only?

Erfan · Accepted Answer · 2019-05-12 13:37:03Z

2

We can unnest your string to rows using this function.

After that we use pivot_table with aggfunc=len:

df = explode_str(df, 'Disease', ' & ')

print(df)
  Patients     Disease
0     Luke     Cooties
1    Nigel  Dragon Pox
2    Sarah    Greycale
2    Sarah     Cooties

df.pivot_table(index='Patients', columns='Disease', aggfunc=len)\
  .fillna(0).reset_index()

Disease Patients  Cooties  Dragon Pox  Greycale
0           Luke      1.0         0.0       0.0
1          Nigel      0.0         1.0       0.0
2          Sarah      1.0         0.0       1.0

Function used from linked answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

answered May 12, 2019 at 13:37

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

2 Comments

Quang Hoang Over a year ago

get_dummies is a better choice than pivot_table here.

Erfan Over a year ago

get_dummies wont give expected output as OP is asking. You will get 4 rows, at least to my knowledge, but feel free to post your own answer

KenHBS · Accepted Answer · 2019-05-12 13:48:30Z

Option 1

You could check the occurrence of disease in df['Disease'] in a loop:

>>> diseases = ['Cooties', 'Dragon Pox', 'Greycale']
>>> for disease in diseases:
>>>     df[disease] = pd.Series(val == disease for val in df['Disease'].values).astype(int)

Option 2

Alternatively, you could use .get_dummies, after you split the strings in df['Disease'] by '& '.

>>> sub_df = df['Disease'].str.split('& ', expand=True)
>>> dummies = pd.get_dummies(sub_df)
>>> dummies

#    0_Cooties  0_Dragon Pox  0_Greycale   1_Cooties
# 0          1             0            0          0
# 1          0             1            0          0
# 2          0             0            1          1

# Let's rename the columns by taking only the text after the '_'
>>> _, dummies.columns = zip(*dummies.columns.str.split('_'))
>>> dummies.groupby(dummies.columns, axis=1).sum()

#      Cooties  Dragon Pox   Greycale 
#   0        1           0          0
#   1        0           1          0
#   2        1           0          1

akalanka · Accepted Answer · 2024-03-25 18:06:41Z

I was looking for an answer to a similar problem but in slightly a different context.

My dataframe is like this.

project    lib
p1          l1
p1          l2
p2          l3
p3          l2

My intended output is like this:

project    l1   l2   l3
p1          1    1    0
p2          0    0    1
p3          0    1    0

If I use get_dummies, my output is like this:

  project  l1  l2  l3
0      p1   1   0   0
1      p1   0   1   0
2      p3   0   0   1
3      p2   0   1   0

For me, pivot_table gave the intended output.

Here is a minimal example:

df_test = pd.DataFrame({
        'project': ['p1', 'p1', 'p3', 'p2'],
        'lib': ['l1', 'l2', 'l3', 'l2']
    })
df_dummy = df_test.pivot_table(index="project", columns="lib", aggfunc=lambda x: 1, fill_value=0)
print(df_dummy)

Collectives™ on Stack Overflow

Binary-vectorize pandas DataFrame column

4 Answers 4

2 Comments

2 Comments

Option 1

Option 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

Option 1

Option 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related