How to split variable sized string-based column into multiple columns in Pandas DataFrame?

Question

I have a pandas DataFrame which is of the form :

A      B       C     D
A1     6       7.5   NaN
A1     4       23.8  <D1 0.0 6.5 12 4, D2 1.0 4 3.5 1>
A2     7       11.9  <D1 2.0 7.5 10 2, D3 7.5 4.2 13.5 4> 
A3    11       0.8   <D2 2.0 7.5 10 2, D3 7.5 4.2 13.5 4, D4 2.0 7.5 10 2, D5 7.5 4.2 13.5 4>

The column D is a raw-string column with multiple categories in each entry. The value of entry is calculated by dividing the last two values for each category. For example, in 2nd row :

D1 = 12/4 = 3
D2 = 3.5/1 = 3.5

I need to split column D based on it's categories and join them to my DataFrame. The problem is the column is dynamic and can have nearly 35-40 categories within a single entry. For now, all I'm doing is a Brute Force Approach by iterating all rows, which is very slow for large datasets. Can someone please help me?

EXPECTED OUTCOME

A      B       C     D1  D2  D3  D4  D5
A1     6       7.5   NaN NaN NaN NaN NaN
A1     4       23.8  3.0 3.5 NaN NaN NaN
A2     7       11.9  5.0 NaN 3.4 NaN NaN 
A3    11       0.8   NaN 5.0 3.4 5.0 3.4

If you are looking for a function that does this and creates dynamic columns based on your data, i am afraid there isnt a function that you could use. You could try to use df.apply() to apply a function created by you on a column or a row, and df.assign() to add columns — Loukik
– Loukik, Commented Jul 29, 2020 at 6:30
@Loukik Yeah, I know. I tried df.apply(), but df.apply() seems slow for large datasets with huge number of row entries. The division could be attained by vectorized operations, but to make the split efficiently is the issue. — saran sky
– saran sky, Commented Jul 29, 2020 at 9:37

Shubham Sharma · Accepted Answer · 2020-07-29 06:35:11Z

3

Use:

d = df['D'].str.extractall(r'(D\d+).*?([\d.]+)\s([\d.]+)(?:,|\>)')
d = d.droplevel(1).set_index(0, append=True).astype(float)
d = df.join(d[1].div(d[2]).round(1).unstack()).drop('D', 1)

Details:

Use Series.str.extractall to extract all the capture groups from the column D as specified by the regex pattern. You can test the regex pattern here.

print(d)
          0     1  2 # --> capture groups
  match             
1 0      D1    12  4
  1      D2   3.5  1
2 0      D1    10  2
  1      D3  13.5  4
3 0      D2    10  2
  1      D3  13.5  4
  2      D4    10  2
  3      D5  13.5  4

Use DataFrame.droplevel + set_index with optional parameter append=True to drop the unused level and append a new index to datafarme.

print(d)
         1    2
  0            
1 D1  12.0  4.0
  D2   3.5  1.0
2 D1  10.0  2.0
  D3  13.5  4.0
3 D2  10.0  2.0
  D3  13.5  4.0
  D4  10.0  2.0
  D5  13.5  4.0

Use Series.div to divide column 1 by 2 and use Series.round to round the values then use Series.unstack to reshape the dataframe, then using DataFrame.join join the new dataframe with df

print(d)
    A   B     C   D1   D2   D3   D4   D5
0  A1   6   7.5  NaN  NaN  NaN  NaN  NaN
1  A1   4  23.8  3.0  3.5  NaN  NaN  NaN
2  A2   7  11.9  5.0  NaN  3.4  NaN  NaN
3  A3  11   0.8  NaN  5.0  3.4  5.0  3.4

edited Jul 29, 2020 at 6:35

answered Jul 29, 2020 at 6:29

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

saran sky Over a year ago

Thanks so much for the quick reply. I'm able to understand everything except the REGEX part. How was the regex able to filter out only the last 2 values? How were the middle nos.eliminated?

Shubham Sharma Over a year ago

Hi, saran, Here we are using .*? in regex pattern which matches any character zero to more times but as few times as possible. i have provided the link to the explanation for regex, please check it.

saran sky Over a year ago

Thanks Shubham, was able to understand how the REGEX worked as well. But, still is there any more way to optimize this further? I ran this on a dataset containing 3.5 lakh rows, and ``` str.extractall ``` alone takes nearly 30-32 secs. In my previous Brute force approach, I split the data and iterated in parallel, and that is taking 26-27 secs.

saran sky Over a year ago

Thanks @Shubham Sharma. Implementing multiprocessing boosted the solutions' performance.

Collectives™ on Stack Overflow

How to split variable sized string-based column into multiple columns in Pandas DataFrame?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related