5

In a fictional patients dataset one might encounter the following table:

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})

Which renders the following dataset:

Fictional diseases

Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a &) and that there exists a complete list diseases of the illnesses, I've yet to find a simple solution to applying to these situations pandas.get_dummies one-hot encoder to obtain a binary vector for each patient.

How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame?

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Cooties":[1, 0, 1],
    "Dragon Pox":[0, 1, 0],
    "Greyscale":[0, 0, 1]
})

Desired result

4 Answers 4

6

You can use Series.str.get_dummies with right separator,

df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index()

    Patients    Cooties Dragon Pox  Greycale
0   Luke        1       0           0
1   Nigel       0       1           0
2   Sarah       1       0           1
Sign up to request clarification or add additional context in comments.

2 Comments

Great answer, didnt know about str.get_dummies +1 for better answer
What is the importance of ' & ' arg in get_dummies? Is this arg related to this question context only?
2

We can unnest your string to rows using this function.

After that we use pivot_table with aggfunc=len:

df = explode_str(df, 'Disease', ' & ')

print(df)
  Patients     Disease
0     Luke     Cooties
1    Nigel  Dragon Pox
2    Sarah    Greycale
2    Sarah     Cooties

df.pivot_table(index='Patients', columns='Disease', aggfunc=len)\
  .fillna(0).reset_index()

Disease Patients  Cooties  Dragon Pox  Greycale
0           Luke      1.0         0.0       0.0
1          Nigel      0.0         1.0       0.0
2          Sarah      1.0         0.0       1.0

Function used from linked answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

2 Comments

get_dummies is a better choice than pivot_table here.
get_dummies wont give expected output as OP is asking. You will get 4 rows, at least to my knowledge, but feel free to post your own answer
0

Option 1

You could check the occurrence of disease in df['Disease'] in a loop:

>>> diseases = ['Cooties', 'Dragon Pox', 'Greycale']
>>> for disease in diseases:
>>>     df[disease] = pd.Series(val == disease for val in df['Disease'].values).astype(int)

Option 2

Alternatively, you could use .get_dummies, after you split the strings in df['Disease'] by '& '.

>>> sub_df = df['Disease'].str.split('& ', expand=True)
>>> dummies = pd.get_dummies(sub_df)
>>> dummies

#    0_Cooties  0_Dragon Pox  0_Greycale   1_Cooties
# 0          1             0            0          0
# 1          0             1            0          0
# 2          0             0            1          1

# Let's rename the columns by taking only the text after the '_'
>>> _, dummies.columns = zip(*dummies.columns.str.split('_'))
>>> dummies.groupby(dummies.columns, axis=1).sum()

#      Cooties  Dragon Pox   Greycale 
#   0        1           0          0
#   1        0           1          0
#   2        1           0          1

Comments

0

I was looking for an answer to a similar problem but in slightly a different context.

My dataframe is like this.

project    lib
p1          l1
p1          l2
p2          l3
p3          l2

My intended output is like this:

project    l1   l2   l3
p1          1    1    0
p2          0    0    1
p3          0    1    0

If I use get_dummies, my output is like this:

  project  l1  l2  l3
0      p1   1   0   0
1      p1   0   1   0
2      p3   0   0   1
3      p2   0   1   0

For me, pivot_table gave the intended output.

Here is a minimal example:

df_test = pd.DataFrame({
        'project': ['p1', 'p1', 'p3', 'p2'],
        'lib': ['l1', 'l2', 'l3', 'l2']
    })
df_dummy = df_test.pivot_table(index="project", columns="lib", aggfunc=lambda x: 1, fill_value=0)
print(df_dummy)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.