How to split/expand a string value into several pandas DataFrame rows?

Question

Let's say my DataFrame df is created like this:

df = pd.DataFrame({"title" : ["Robin Hood", "Madagaskar"],
                  "genres" : ["Action, Adventure", "Family, Animation, Comedy"]},
                 columns=["title", "genres"])

and it looks like this:

        title                     genres
0  Robin Hood          Action, Adventure
1  Madagaskar  Family, Animation, Comedy

Let's assume each movie can have any number of genres. How can I expand the DataFrame into

        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy

?

MaxU - stand with Ukraine · Accepted Answer · 2019-02-13 09:43:32Z

8

In [33]: (df.set_index('title')
            ['genres'].str.split(',\s*', expand=True)
            .stack()
            .reset_index(name='genre')
            .drop('level_1',1))
Out[33]:
        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy

PS here you can find more generic approach.

edited Feb 13, 2019 at 9:43

answered Nov 30, 2017 at 11:02

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2017-11-30 12:50:09Z

You can use np.repeat with numpy.concatenate for flattening.

splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()

df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
                     'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
        title      genres
0  Robin Hood      Action
1  Robin Hood   Adventure
2  Madagaskar      Family
3  Madagaskar   Animation
4  Madagaskar      Comedy

Timings:

df = pd.concat([df]*100000).reset_index(drop=True)

In [95]: %%timeit
    ...: splitted = df['genres'].str.split(',\s*')
    ...: l = splitted.str.len()
    ...: 
    ...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
    ...:                      'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
    ...: 
    ...: 
1 loop, best of 3: 709 ms per loop

In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop

Erfan · Accepted Answer · 2019-12-27 16:29:10Z

1

Since pandas >= 0.25.0 we have a native method for this called explode.

This method unnests each element in a list to a new row and repeats the other columns.

So first we have to call Series.str.split on our string value to split the string to list of elements.

>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')

        title     genres
0  Robin Hood     Action
0  Robin Hood  Adventure
1  Madagaskar     Family
1  Madagaskar  Animation
1  Madagaskar     Comedy

edited Dec 27, 2019 at 16:29

answered Dec 27, 2019 at 12:38

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Collectives™ on Stack Overflow

How to split/expand a string value into several pandas DataFrame rows?

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related