5

I have a pandas DataFrame which is of the form :

A      B       C     D
A1     6       7.5   NaN
A1     4       23.8  <D1 0.0 6.5 12 4, D2 1.0 4 3.5 1>
A2     7       11.9  <D1 2.0 7.5 10 2, D3 7.5 4.2 13.5 4> 
A3    11       0.8   <D2 2.0 7.5 10 2, D3 7.5 4.2 13.5 4, D4 2.0 7.5 10 2, D5 7.5 4.2 13.5 4>

The column D is a raw-string column with multiple categories in each entry. The value of entry is calculated by dividing the last two values for each category. For example, in 2nd row :

D1 = 12/4 = 3
D2 = 3.5/1 = 3.5

I need to split column D based on it's categories and join them to my DataFrame. The problem is the column is dynamic and can have nearly 35-40 categories within a single entry. For now, all I'm doing is a Brute Force Approach by iterating all rows, which is very slow for large datasets. Can someone please help me?

EXPECTED OUTCOME

A      B       C     D1  D2  D3  D4  D5
A1     6       7.5   NaN NaN NaN NaN NaN
A1     4       23.8  3.0 3.5 NaN NaN NaN
A2     7       11.9  5.0 NaN 3.4 NaN NaN 
A3    11       0.8   NaN 5.0 3.4 5.0 3.4
2
  • If you are looking for a function that does this and creates dynamic columns based on your data, i am afraid there isnt a function that you could use. You could try to use df.apply() to apply a function created by you on a column or a row, and df.assign() to add columns Commented Jul 29, 2020 at 6:30
  • @Loukik Yeah, I know. I tried df.apply(), but df.apply() seems slow for large datasets with huge number of row entries. The division could be attained by vectorized operations, but to make the split efficiently is the issue. Commented Jul 29, 2020 at 9:37

1 Answer 1

3

Use:

d = df['D'].str.extractall(r'(D\d+).*?([\d.]+)\s([\d.]+)(?:,|\>)')
d = d.droplevel(1).set_index(0, append=True).astype(float)
d = df.join(d[1].div(d[2]).round(1).unstack()).drop('D', 1)

Details:

Use Series.str.extractall to extract all the capture groups from the column D as specified by the regex pattern. You can test the regex pattern here.

print(d)
          0     1  2 # --> capture groups
  match             
1 0      D1    12  4
  1      D2   3.5  1
2 0      D1    10  2
  1      D3  13.5  4
3 0      D2    10  2
  1      D3  13.5  4
  2      D4    10  2
  3      D5  13.5  4

Use DataFrame.droplevel + set_index with optional parameter append=True to drop the unused level and append a new index to datafarme.

print(d)
         1    2
  0            
1 D1  12.0  4.0
  D2   3.5  1.0
2 D1  10.0  2.0
  D3  13.5  4.0
3 D2  10.0  2.0
  D3  13.5  4.0
  D4  10.0  2.0
  D5  13.5  4.0

Use Series.div to divide column 1 by 2 and use Series.round to round the values then use Series.unstack to reshape the dataframe, then using DataFrame.join join the new dataframe with df

print(d)
    A   B     C   D1   D2   D3   D4   D5
0  A1   6   7.5  NaN  NaN  NaN  NaN  NaN
1  A1   4  23.8  3.0  3.5  NaN  NaN  NaN
2  A2   7  11.9  5.0  NaN  3.4  NaN  NaN
3  A3  11   0.8  NaN  5.0  3.4  5.0  3.4
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks so much for the quick reply. I'm able to understand everything except the REGEX part. How was the regex able to filter out only the last 2 values? How were the middle nos.eliminated?
Hi, saran, Here we are using .*? in regex pattern which matches any character zero to more times but as few times as possible. i have provided the link to the explanation for regex, please check it.
Thanks Shubham, was able to understand how the REGEX worked as well. But, still is there any more way to optimize this further? I ran this on a dataset containing 3.5 lakh rows, and ``` str.extractall ``` alone takes nearly 30-32 secs. In my previous Brute force approach, I split the data and iterated in parallel, and that is taking 26-27 secs.
Thanks @Shubham Sharma. Implementing multiprocessing boosted the solutions' performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.