6

Let's say my DataFrame df is created like this:

df = pd.DataFrame({"title" : ["Robin Hood", "Madagaskar"],
                  "genres" : ["Action, Adventure", "Family, Animation, Comedy"]},
                 columns=["title", "genres"])

and it looks like this:

        title                     genres
0  Robin Hood          Action, Adventure
1  Madagaskar  Family, Animation, Comedy

Let's assume each movie can have any number of genres. How can I expand the DataFrame into

        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy

?

3 Answers 3

8
In [33]: (df.set_index('title')
            ['genres'].str.split(',\s*', expand=True)
            .stack()
            .reset_index(name='genre')
            .drop('level_1',1))
Out[33]:
        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy

PS here you can find more generic approach.

Sign up to request clarification or add additional context in comments.

Comments

4

You can use np.repeat with numpy.concatenate for flattening.

splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()

df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
                     'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
        title      genres
0  Robin Hood      Action
1  Robin Hood   Adventure
2  Madagaskar      Family
3  Madagaskar   Animation
4  Madagaskar      Comedy

Timings:

df = pd.concat([df]*100000).reset_index(drop=True)

In [95]: %%timeit
    ...: splitted = df['genres'].str.split(',\s*')
    ...: l = splitted.str.len()
    ...: 
    ...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
    ...:                      'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
    ...: 
    ...: 
1 loop, best of 3: 709 ms per loop

In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop

Comments

1

Since pandas >= 0.25.0 we have a native method for this called explode.

This method unnests each element in a list to a new row and repeats the other columns.

So first we have to call Series.str.split on our string value to split the string to list of elements.

>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')

        title     genres
0  Robin Hood     Action
0  Robin Hood  Adventure
1  Madagaskar     Family
1  Madagaskar  Animation
1  Madagaskar     Comedy

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.