Pandas DataFrame - Splitting Series Strings into Multiple Columns

Question

My question is more about the methodology/syntax described into a previous post which addresses different approaches to meet the same objective of splitting string values into lists and assigning each list item to a new column. Here's the post: Pandas DataFrame, how do i split a column into two

df:

                          GDP
Date                        
Mar 31, 2017  19.03 trillion
Dec 31, 2016  18.87 trillion

script 1 + ouput:

>>> df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1).str
>>> print(df)

                GDP     Units
Date                         
Mar 31, 2017  19.03  trillion
Dec 31, 2016  18.87  trillion

script 2 + output:

>>> df[['GDP', 'Units']] = df['GDP'].str.split(' ', 1, expand=True)
>>> print(df)

                GDP     Units
Date                         
Mar 31, 2017  19.03  trillion
Dec 31, 2016  18.87  trillion

script 3 + output:

>>> df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1, expand=True)
>>> print(df)

              GDP  Units
Date                    
Mar 31, 2017    0      1
Dec 31, 2016    0      1

Can anyone explain what is going on? Why does script 3 produce these values in the output?

piRSquared · Accepted Answer · 2017-07-01 23:44:35Z

Let's start by looking at this

df['GDP'].str.split(' ', 1)

0    [19.03, trillion]
1    [18.87, trillion]
Name: GDP, dtype: object

It produces a series of lists. However, the pd.Series.str, aka string accessor allows us to access the first, second, ... parts of these embedded lists via intuitive python list indexing.

df['GDP'].str.split(' ', 1).str[0]

Date
Mar 31, 2017    19.03
Dec 31, 2016    18.87
Name: GDP, dtype: object

Or

df['GDP'].str.split(' ', 1).str[1]

Date
Mar 31, 2017    trillion
Dec 31, 2016    trillion
Name: GDP, dtype: object

So, if we split into two element lists, split(' ', 1) we can treat the return object from an additional str as an iterable

a, b = df['GDP'].str.split(' ', 1).str

a

Date
Mar 31, 2017    19.03
Dec 31, 2016    18.87
Name: GDP, dtype: object

And

b

Date
Mar 31, 2017    trillion
Dec 31, 2016    trillion
Name: GDP, dtype: object

Ok, we can short-cut the creation of two new columns by leveraging this iterable unpacking

df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1).str

However, we can pass a parameter to expand our new lists into new dataframe columns

df['GDP'].str.split(' ', 1, expand=True)

                  0         1
Date                         
Mar 31, 2017  19.03  trillion
Dec 31, 2016  18.87  trillion

Now we can assign a dataframe to new columns of another dataframe like so

df[['GDP', 'Units']] = df['GDP'].str.split(' ', 1, expand=True)

However, when we do

df['GDP'], df['Units'] = df['GDP'].str.split(' ', 1, expand=True)

The return value of df['GDP'].str.split(' ', 1, expand=True) gets unpacked and those results are simply the column values. If you see just above, you notice they are 0 and 1. So in this case, 0 is assigned to the column df['GDP'] and 1 is assigned to the column df['Units']

Collectives™ on Stack Overflow

Pandas DataFrame - Splitting Series Strings into Multiple Columns

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related