Sum value in specific combinations of rows

Question

I have the following dataframe:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
                    'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
                    'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
                    'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
                    'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
   Name           Color       Value_1       Value_2       Value_3
0  Jake                  1.211233e+06           NaN  1.175300e+06
1  Nate        red;blue  4.007490e+06  1.509907e+06           NaN
2             blue;pink  9.534747e+05  4.792270e+06  1.888559e+06
3  Alex  green;blue;red           NaN  4.348659e+04           NaN
4                        1.761988e+06           NaN  4.446890e+05
5   Max                  2.226004e+08           NaN  4.055131e+05
6  Nate            blue  4.044192e+05  2.066645e+06  3.437040e+05
7  Jake      red;yellow  6.060661e+05  6.098866e+07  2.948494e+06

I need two things:

1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:

   Name       Value_1       Value_2       Value_3
0  Jake  1.817299e+06  6.098866e+07  4.123794e+06
1  Nate  4.411909e+06  3.576553e+06  3.437040e+05
2  Alex           NaN  4.348659e+04           NaN
3   Max  2.226004e+08           NaN  4.055131e+05

2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):

   Name           Color       Value_1       Value_2       Value_3
0  Alex           green           NaN  4.348659e+04           NaN
1  Alex            blue           NaN  4.348659e+04           NaN
3  Alex             red           NaN  4.348659e+04           NaN
4  Jake             red  6.060661e+05  6.098866e+07  2.948494e+06
5  Jake          yellow  6.060661e+05  6.098866e+07  2.948494e+06
6  Nate             red  4.007490e+06  1.509907e+06           NaN
7  Nate            blue  4.411909e+06  3.576553e+06  3.437040e+05

(Note that in this case the only line present twice is Nate-Blue)

[Edit]

I apologize but I had not considered a further case and I am unable to resolve it: For point 2: in all cases where I have several times the same color for the same name, separated by semicolon as in the example:

Name       color   Value_1   Value_2   Value_2
Max       red;red     1         1         1
Jake    b;b;b;y;y     1         1         1
Max           red     3         3         3

I will receive something like:

Name       color   Value_1   Value_2   Value_2
 Max       red         5         5         5
 Jake        b         3         3         3
 Jake        y         2         2         2

Because it adds up each value for each color associated with that name But I would like color repeats in the same row for the same name to be counted only once:

Name       color   Value_1   Value_2   Value_2
Max       red         4         4         4
Jake        b         1         1         1
Jake        b         1         1         1

E.g for row with index=4 it means it is data for Alex;green;blue;red ? From rows before? — jezrael
– jezrael, Commented Sep 7, 2022 at 10:02
for your first question yes it is, for your second: are null values that I will have to discard as you did below (sorry for the delay) — Mario
– Mario, Commented Sep 7, 2022 at 11:19

jezrael · Accepted Answer · 2022-10-13 09:32:36Z

1

First replace empty strings in first 2 columns to mising values:

df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)

Then aggregate sum with min_count=1 for missing values instead 0:

df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
   Name       Value_1       Value_2       Value_3
0  Alex           NaN  4.348659e+04           NaN
1  Jake  1.817299e+06  6.098866e+07  4.123794e+06
2   Max  2.226004e+08           NaN  4.055131e+05
3  Nate  4.411909e+06  3.576553e+06  3.437040e+05

For second ouput first use Series.str.split with DataFrame.explode and then aggregate sum:

df3 = (df1.assign(Color=df1['Color'].str.split(';'))
          .explode('Color')
          .groupby(['Name', 'Color'], as_index=False)
          .sum(min_count=1))
print (df3)
   Name   Color       Value_1       Value_2       Value_3
0  Alex    blue           NaN  4.348659e+04           NaN
1  Alex   green           NaN  4.348659e+04           NaN
2  Alex     red           NaN  4.348659e+04           NaN
3  Jake     red  6.060661e+05  6.098866e+07  2.948494e+06
4  Jake  yellow  6.060661e+05  6.098866e+07  2.948494e+06
5  Nate    blue  4.411909e+06  3.576553e+06  3.437040e+05
6  Nate     red  4.007490e+06  1.509907e+06           NaN

EDIT: You can remove duplicates by all rows by DataFrame.drop_duplicates:

df3 = (df1.assign(color=df1['color'].str.split(';'))
          .explode('color')
          .drop_duplicates()
          .groupby(['Name', 'color'], as_index=False, sort=False)
          .sum(min_count=1)
          )
print (df3)
   Name color  Value_1  Value_2  Value_3
0   Max   red        4        4        4
1  Jake     b        1        1        1
2  Jake     y        1        1        1

edited Oct 13, 2022 at 9:32

answered Sep 7, 2022 at 9:56

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

jezrael Over a year ago

@Mario - df1.assign(Color=df1['Color'].str.split(';')) means - get column Color, split and assign back splitted values to column Color. So is possible use .explode('Color') If need some other column use df1.assign(new=df1['Color'].str.split(';')).explode('new')

Mario Over a year ago

But in this case I would have to put new in the "group by" and this would change the name of the column. I can change the column name later but I was wondering if there is a better way

jezrael Over a year ago

@Mario - hmm, question is why need change Color to another name?

jezrael Over a year ago

@Mario - agree, then need

(df1.assign(new=df1['Color'].str.split(';'))           .explode('new')           .groupby(['Name', 'new'], as_index=False)           .sum(min_count=1))

jezrael Over a year ago

@Mario - answer was edited.

|

mozway · Accepted Answer · 2022-09-07 09:56:50Z

1

You can use:

(df1.assign(Color=df1['Color'].str.split(';'))
    .explode('Color')
    .groupby(['Name', 'Color'], as_index=False)
    .sum()
    .replace('', pd.NA).dropna()
)

output:

    Name   Color       Value_1       Value_2       Value_3
3   Alex    blue  0.000000e+00  4.348659e+04  0.000000e+00
4   Alex   green  0.000000e+00  4.348659e+04  0.000000e+00
5   Alex     red  0.000000e+00  4.348659e+04  0.000000e+00
7   Jake     red  6.060661e+05  6.098866e+07  2.948494e+06
8   Jake  yellow  6.060661e+05  6.098866e+07  2.948494e+06
10  Nate    blue  4.411909e+06  3.576553e+06  3.437040e+05
11  Nate     red  4.007490e+06  1.509907e+06  0.000000e+00

answered Sep 7, 2022 at 9:56

mozway

267k13 gold badges56 silver badges106 bronze badges

Comments

R. Baraiya · Accepted Answer · 2022-09-07 09:55:58Z

0

df1['Color'] = df1['Color'].apply(lambda x: x.split(';'))
df1.explode('Color')

answered Sep 7, 2022 at 9:55

R. Baraiya

1,5281 gold badge6 silver badges20 bronze badges

Collectives™ on Stack Overflow

Sum value in specific combinations of rows

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related