Split string to columns using text as column headers and column values in pandas

Question

I have a df that has 1 column where each row contains a string. It looks like this:

df
          data
in 9.14  out 9.66  type 0.0
in 9.67  out 9.69  type 0.0
in 9.70  out 10.66 type 0.0
in 10.67 out 11.34 type 2.0
in 11.35 out 12.11 type 2.0

I want to split the text of this column into multiple columns. I want to use the words [in, out, type] as column headers, and the values following each word as the row values. The result will have 3 columns labeled in, out and type and will look like this:

        df
        
         in    out   type
        9.14   9.66   0.0
        9.67   9.69   0.0
        9.70   10.66  0.0
        10.67  11.34  2.0
        11.35  12.11  2.0

Thanks!

user3483203 · Accepted Answer · 2019-08-29 17:12:03Z

1

If you know in advance what the words will be, and also can guarantee that there won't be any bad data, this is a simple str.extract problem, where you can construct a robust regular expression to capture each group, using named groups to create the DataFrame in a single pass. That regular expression for your sample data is contained in approach #2.

However, for the sake of demonstration, it is better to assume that you might have bad data, and that you might not know in advance what your column names are. In that case, you can use str.extractall and some unstacking.

Option 1
extractall + set_index + unstack

generic_regex = r'([a-zA-Z]+)[^0-9]+([0-9\.]+)'

df['data'].str.extractall(generic_regex).set_index(0, append=True)[1].unstack([0, 1])

0         in    out type
match      0      1    2
0       9.14   9.66  0.0
1       9.67   9.69  0.0
2       9.70  10.66  0.0
3      10.67  11.34  2.0
4      11.35  12.11  2.0

Option 2
Define an explicit regex and use extract

rgx = r'in\s+(?P<in>[^\s]+)\s+out\s+(?P<out>[^\s]+)\s+type\s+(?P<type>[^\s]+)'

df['data'].str.extract(rgx)

      in    out type
0   9.14   9.66  0.0
1   9.67   9.69  0.0
2   9.70  10.66  0.0
3  10.67  11.34  2.0
4  11.35  12.11  2.0

edited Aug 29, 2019 at 17:12

answered Aug 29, 2019 at 16:51

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

connor449 Over a year ago

Thank user3483203 , unfortunately, the code splits the columns appropriately, but all of the data turns into 'NaN". Please advise.

connor449 Over a year ago

user3483203 thanks for the more detailed response. I understand what is going on more, but the generic approach doesn't yield any result. The code runs fine, but the output is empty.

user3483203 Over a year ago

Is the sample dataframe that you shared similar to your actual dataframe?

connor449 Over a year ago

Got it, thanks. The generic approach worked. My actual dataframe had a slight difference that once I fixed, everything worked. Thanks so much!

Andy L. · Accepted Answer · 2019-08-29 22:37:54Z

0

If you data separated evenly between name and value by white-spaces as in your sample , you may use split and str accessor with stride to construct the desired output

df1 = df['data'].str.split()
df_out = pd.DataFrame(df1.str[1::2].tolist(), columns=df1[0][0::2])

Out[1097]:
      in    out type
0   9.14   9.66  0.0
1   9.67   9.69  0.0
2   9.70  10.66  0.0
3  10.67  11.34  2.0
4  11.35  12.11  2.0

answered Aug 29, 2019 at 22:37

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Collectives™ on Stack Overflow

Split string to columns using text as column headers and column values in pandas

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related