Split a Pandas column of lists into multiple columns

Question

I have a Pandas DataFrame with one column:

import pandas as pd

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

How can split this column of lists into two columns?

Desired result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Peter Mortensen · Accepted Answer · 2021-07-24 13:26:04Z

556

You can use the DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for a new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

A solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Jul 24, 2021 at 13:26

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Feb 18, 2016 at 20:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

AMC Over a year ago

Let us continue this discussion in chat.

AMA Over a year ago

List method will reset index. If you want to keep index, copy index after construct df2 with df1.index = d2.index.

Roy Assis Over a year ago

what about : df1.apply(lambda x: x["teams"], result_type="expand",axis=1)

jezrael Over a year ago

@RoyAssis - Can you test performance? I think it is slow

Paul Coccoli Over a year ago

Seems like the "tolist" method only works when all lists are the same length.

|

AMC · Accepted Answer · 2020-05-01 11:08:21Z

126

Much simpler solution:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

edited May 1, 2020 at 11:08

AMC

2,6977 gold badges15 silver badges35 bronze badges

answered Jun 15, 2018 at 17:03

Joe Davison

1,9082 gold badges14 silver badges11 bronze badges

3 Comments

ikel Over a year ago

what if each list has uneven number of elements?

AMC Over a year ago

If you wanted to split a column of delimited strings rather than lists, you could similarly do: df["teams"].str.split('<delim>', expand=True) already returns a DataFrame, so it would probably be simpler to just rename the columns.

yeliabsalohcin Over a year ago

Thanks @AMC that did it for me, but as my "list" of UUID's was actually a str pretending to be a list, first I had to apply a lambda to remove square brackets.

Kevin Markham · Accepted Answer · 2019-08-14 16:24:16Z

80

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here's the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

answered Aug 14, 2019 at 16:24

Kevin Markham

6,1201 gold badge32 silver badges38 bronze badges

1 Comment

Attila the Fun Over a year ago

.apply(pd.Series) is easy to remember and type. Unfortunately, as stated in other answers, it is also very slow for large numbers of observations. If the index to be preserved is easily accessible, preservation using the DataFrame constructor approach is as simple as passing the index argument to the constructor, as seen in other answers. In the middle of a method chain, one workaround is to store an intermediate Series or DataFrame using an assignment expression (Python 3.8+) and then access the index from there.

mikkokotila · Accepted Answer · 2018-01-09 11:53:39Z

24

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I'm assuming that the column is called 'meta' in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

answered Jan 9, 2018 at 11:53

mikkokotila

1,43113 silver badges17 bronze badges

1 Comment

yeliabsalohcin Over a year ago

This answer was helpful, and could be paired with this visual demonstration of the technique: youtube.com/watch?v=vPKwm1XZjp8

W. Ding · Accepted Answer · 2021-09-12 03:36:05Z

19

I would like to recommend a more efficient and Pythonic way.

First define the DataFrame as original post:

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})

My solution:

%%timeit
df['team1'], df['team2'] = zip(*list(df['teams'].values))
>> 761 µs ± 8.35 µs per loop

In comparison, the most upvoted solution:

%%timeit
df[['team1','team2']] = pd.DataFrame(df.teams.tolist(), index=df.index)
df = pd.DataFrame(df['teams'].to_list(), columns=['team1','team2'])
>> 1.31 ms ± 11.2 µs per loop

My solution saves 40% time and is much shorter. The only thing you need to remember is how to unpack and reshape a two-dimension list by using zip(*list).

answered Sep 12, 2021 at 3:36

W. Ding

6895 silver badges16 bronze badges

1 Comment

Greenstick Over a year ago

This df['team1'], df['team2'] = zip(*list(df['teams'].values)) should be the accepted answer. Simple, readable, idiomatic.

Peter Mortensen · Accepted Answer · 2021-07-24 13:33:11Z

13

List comprehension

A simple implementation with list comprehension (my favorite)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

Timing on output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

Output:

team_1    team_2
0    SF    NYG
1    SF    NYG
2    SF    NYG
3    SF    NYG
4    SF    NYG
5    SF    NYG
6    SF    NYG

edited Jul 24, 2021 at 13:33

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered May 24, 2020 at 9:28

Talis

3115 silver badges14 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2021-07-24 13:29:47Z

The previous solutions didn't work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

object of type 'float' has no len()

I solve this using a list comprehension. Here is the replicable example:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

Output:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

Solving with a list comprehension,

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

yields:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG

Peter Mortensen · Accepted Answer · 2021-07-24 13:35:31Z

Here's another solution using df.transform and df.set_index:

>>> from operator import itemgetter
>>> df['teams'].transform({'item1': itemgetter(0), 'item2': itemgetter(1)})

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Which of course can be generalized as:

>>> indices = range(len(df['teams'][0]))

>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

This approach has the added benefit of extracting the desired indices:

>>> df
                 teams
0  [SF, NYG, XYZ, ABC]
1  [SF, NYG, XYZ, ABC]
2  [SF, NYG, XYZ, ABC]
3  [SF, NYG, XYZ, ABC]
4  [SF, NYG, XYZ, ABC]
5  [SF, NYG, XYZ, ABC]
6  [SF, NYG, XYZ, ABC]

>>> indices = [0, 2]
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})

  team1 team3
0    SF   XYZ
1    SF   XYZ
2    SF   XYZ
3    SF   XYZ
4    SF   XYZ
5    SF   XYZ
6    SF   XYZ

ailurid · Accepted Answer · 2019-11-03 06:07:35Z

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

praegustator · Accepted Answer · 2022-12-27 00:14:27Z

If someone comes here to find a ready-made function, I wrote one.

it finds all columns with lists and unfolds them, if columns are not specified;
added columns are named as column_name_0, column_name_1, etc.;
the columns order is preserved in final dataframe;
if strict=True, it checks whether lists in a given column are of equal size.

Improvements and comments are appreciated.

def unfold_columns(df, columns=[], strict=False):
    assert isinstance(columns, list), "Columns should be a list of column names"
    if len(columns) == 0:
        columns = [
            column for column in df.columns 
            if df.applymap(lambda x: isinstance(x, list)).all()[column]
        ]
    else:
        assert(all([(column in df.columns) for column in columns])), \
            "Not all given columns are found in df"
    columns_order = df.columns
    for column_name in columns:
        if df[column_name].apply(lambda x: isinstance(x, list)).all():
            if strict:
                assert len(set(df[column_name].apply(lambda x: len(x)))) == 1, \
                    f"Lists in df['{column_name}'] are not of equal length"
            unfolded = pd.DataFrame(df[column_name].tolist())
            unfolded.columns = [f'{column_name}_{x}' for x in unfolded.columns]
            columns_order = [
                *columns_order[:list(columns_order).index(column_name)], 
                *unfolded.columns, 
                *columns_order[list(columns_order).index(column_name)+1:]
            ]
            df = df.join(unfolded).drop([column_name], axis=1)
    return df[columns_order]

dougissi · Accepted Answer · 2023-04-25 22:51:19Z

2

To append two new columns to the existing DataFrame:

df[['team1', 'team2']] = df["teams"].to_list()

answered Apr 25, 2023 at 22:51

dougissi

1312 silver badges4 bronze badges

Comments

Leonid Mednikov · Accepted Answer · 2023-08-04 08:58:55Z

1

Summarizing all the answers. If you need just create new DataFrame with 2 columns

pd.DataFrame(df['teams'].tolist(), columns=['team1', 'team2'], index=df.index)

If you want to assign to the same df, you have several options.

The shortest

df[['team1', 'team2']] = df['teams'].tolist()
The slowest (don't recommend, it might be 10x time slower or more without any advantages)

df[['team1', 'team2']] = df['teams'].apply(pd.Series)
And the fastest for some reason (almost 2x faster than first).

df['team1'], df['team2'] = zip(*df['teams'].tolist())

So I would recommend 1. If you really need speed, you can try 3, but that looks strange and the advantage may disappear in future versions.

answered Aug 4, 2023 at 8:58

Leonid Mednikov

9734 silver badges14 bronze badges

Comments

lu di · Accepted Answer · 2022-02-23 09:30:45Z

0

you can try to use two times of apply to create new column 'team1' and 'team2' in your df

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
df["team1"]=df['teams'].apply(lambda x: x[0]  )
df["team2"]=df['teams'].apply(lambda x: x[1]  )
df

enter image description here

answered Feb 23, 2022 at 9:30

lu di

11

1 Comment

Akantor Over a year ago

On dataframes that come from a groupby aggregation, this is the only method that worked for me..

Collectives™ on Stack Overflow

Split a Pandas column of lists into multiple columns

13 Answers 13

7 Comments

3 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

7 Comments

3 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related