1

I have the following dataframe:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
                    'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
                    'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
                    'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
                    'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
   Name           Color       Value_1       Value_2       Value_3
0  Jake                  1.211233e+06           NaN  1.175300e+06
1  Nate        red;blue  4.007490e+06  1.509907e+06           NaN
2             blue;pink  9.534747e+05  4.792270e+06  1.888559e+06
3  Alex  green;blue;red           NaN  4.348659e+04           NaN
4                        1.761988e+06           NaN  4.446890e+05
5   Max                  2.226004e+08           NaN  4.055131e+05
6  Nate            blue  4.044192e+05  2.066645e+06  3.437040e+05
7  Jake      red;yellow  6.060661e+05  6.098866e+07  2.948494e+06

I need two things:

1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:

   Name       Value_1       Value_2       Value_3
0  Jake  1.817299e+06  6.098866e+07  4.123794e+06
1  Nate  4.411909e+06  3.576553e+06  3.437040e+05
2  Alex           NaN  4.348659e+04           NaN
3   Max  2.226004e+08           NaN  4.055131e+05

2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):

   Name           Color       Value_1       Value_2       Value_3
0  Alex           green           NaN  4.348659e+04           NaN
1  Alex            blue           NaN  4.348659e+04           NaN
3  Alex             red           NaN  4.348659e+04           NaN
4  Jake             red  6.060661e+05  6.098866e+07  2.948494e+06
5  Jake          yellow  6.060661e+05  6.098866e+07  2.948494e+06
6  Nate             red  4.007490e+06  1.509907e+06           NaN
7  Nate            blue  4.411909e+06  3.576553e+06  3.437040e+05

(Note that in this case the only line present twice is Nate-Blue)

[Edit]

I apologize but I had not considered a further case and I am unable to resolve it: For point 2: in all cases where I have several times the same color for the same name, separated by semicolon as in the example:

Name       color   Value_1   Value_2   Value_2
Max       red;red     1         1         1
Jake    b;b;b;y;y     1         1         1
Max           red     3         3         3

I will receive something like:

Name       color   Value_1   Value_2   Value_2
 Max       red         5         5         5
 Jake        b         3         3         3
 Jake        y         2         2         2

Because it adds up each value for each color associated with that name But I would like color repeats in the same row for the same name to be counted only once:

Name       color   Value_1   Value_2   Value_2
Max       red         4         4         4
Jake        b         1         1         1
Jake        b         1         1         1
3
  • E.g for row with index=4 it means it is data for Alex;green;blue;red ? From rows before? Commented Sep 7, 2022 at 10:02
  • What means empty strings in first 2 columns? Commented Sep 7, 2022 at 10:03
  • for your first question yes it is, for your second: are null values that I will have to discard as you did below (sorry for the delay) Commented Sep 7, 2022 at 11:19

3 Answers 3

1

First replace empty strings in first 2 columns to mising values:

df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)

Then aggregate sum with min_count=1 for missing values instead 0:

df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
   Name       Value_1       Value_2       Value_3
0  Alex           NaN  4.348659e+04           NaN
1  Jake  1.817299e+06  6.098866e+07  4.123794e+06
2   Max  2.226004e+08           NaN  4.055131e+05
3  Nate  4.411909e+06  3.576553e+06  3.437040e+05

For second ouput first use Series.str.split with DataFrame.explode and then aggregate sum:

df3 = (df1.assign(Color=df1['Color'].str.split(';'))
          .explode('Color')
          .groupby(['Name', 'Color'], as_index=False)
          .sum(min_count=1))
print (df3)
   Name   Color       Value_1       Value_2       Value_3
0  Alex    blue           NaN  4.348659e+04           NaN
1  Alex   green           NaN  4.348659e+04           NaN
2  Alex     red           NaN  4.348659e+04           NaN
3  Jake     red  6.060661e+05  6.098866e+07  2.948494e+06
4  Jake  yellow  6.060661e+05  6.098866e+07  2.948494e+06
5  Nate    blue  4.411909e+06  3.576553e+06  3.437040e+05
6  Nate     red  4.007490e+06  1.509907e+06           NaN

EDIT: You can remove duplicates by all rows by DataFrame.drop_duplicates:

df3 = (df1.assign(color=df1['color'].str.split(';'))
          .explode('color')
          .drop_duplicates()
          .groupby(['Name', 'color'], as_index=False, sort=False)
          .sum(min_count=1)
          )
print (df3)
   Name color  Value_1  Value_2  Value_3
0   Max   red        4        4        4
1  Jake     b        1        1        1
2  Jake     y        1        1        1
Sign up to request clarification or add additional context in comments.

6 Comments

@Mario - df1.assign(Color=df1['Color'].str.split(';')) means - get column Color, split and assign back splitted values to column Color. So is possible use .explode('Color') If need some other column use df1.assign(new=df1['Color'].str.split(';')).explode('new')
But in this case I would have to put new in the "group by" and this would change the name of the column. I can change the column name later but I was wondering if there is a better way
@Mario - hmm, question is why need change Color to another name?
@Mario - agree, then need (df1.assign(new=df1['Color'].str.split(';')) .explode('new') .groupby(['Name', 'new'], as_index=False) .sum(min_count=1))
@Mario - answer was edited.
|
1

You can use:

(df1.assign(Color=df1['Color'].str.split(';'))
    .explode('Color')
    .groupby(['Name', 'Color'], as_index=False)
    .sum()
    .replace('', pd.NA).dropna()
)

output:

    Name   Color       Value_1       Value_2       Value_3
3   Alex    blue  0.000000e+00  4.348659e+04  0.000000e+00
4   Alex   green  0.000000e+00  4.348659e+04  0.000000e+00
5   Alex     red  0.000000e+00  4.348659e+04  0.000000e+00
7   Jake     red  6.060661e+05  6.098866e+07  2.948494e+06
8   Jake  yellow  6.060661e+05  6.098866e+07  2.948494e+06
10  Nate    blue  4.411909e+06  3.576553e+06  3.437040e+05
11  Nate     red  4.007490e+06  1.509907e+06  0.000000e+00

Comments

0
df1['Color'] = df1['Color'].apply(lambda x: x.split(';'))
df1.explode('Color')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.