Adding two pandas dataframes together - Specific Columns

Question

My question is an expansion of the one answered here:

Assuming the same dataframes but with a new column containing strings.

import pandas as pd

df1 = pd.DataFrame([('Dog',1,2),('Cat',3,4),('Rabbit',5,6)], columns=['Animal','a','b'])

df2 = pd.DataFrame([('Dog',100,200),('Cat',300,400),('Rabbit',500,600)], columns=['Animal','a','b']

Using the solution would create this:

df_add = df1.add(df2, fill_value=0)

Out: 

       Animal        a    b
    0  DogDog       101  202
    1  CatCat       303  404
    2  RabbitRabbit 505  606

A potential solution could be just to index the Animal column and then run the .add function and then unindex the animal column again. But is there a more simple way that just adjusts this formula df_add = df1.add(df2, fill_value=0)so that the following solution is given:

Out: 

       Animal     a    b
    0  Dog       101  202
    1  Cat       303  404
    2  Rabbit    505  606

I tried df_add.iloc[:,1:] = df1.iloc[:,1:].add(df2.iloc[:,1:], fill_value=0) and it did not work.

what if you had "Parrot" in place of rabbit in df1? What should be the expected result? — mozway
– mozway, Commented Aug 1, 2024 at 8:30
Setting the keys temporarily as index is the best approach IMO. — mozway
– mozway, Commented Aug 1, 2024 at 8:34
Also, can you have duplicated keys in either input? If yes, what should happen? — mozway
– mozway, Commented Aug 1, 2024 at 8:39

mozway · Accepted Answer · 2024-08-01 08:46:21Z

1

Your question is not fully clear. Pandas performs addition operations after aligning the indexes (index + columns), thus if you want to ensure Dog is added to Dog irrespective of its position, setting the index is the way to go:

key = ['Animal']

out = df1.set_index(key).add(df2.set_index(key), fill_value=0).reset_index()

Output:

   Animal    a    b
0     Dog  101  202
1     Cat  303  404
2  Rabbit  505  606

This will ensure that there is no mismatch. For example:

df1 = pd.DataFrame([('Cat',3,4),('Dog',1,2),('Parrot',8,9)], columns=['Animal','a','b'])
df2 = pd.DataFrame([('Dog',100,200),('Cat',300,400),('Rabbit',800,00)], columns=['Animal','a','b'])
                   
key = ['Animal']
df_add = df1.set_index(key).add(df2.set_index(key), fill_value=0).reset_index()

Output:

   Animal      a      b
0     Cat  303.0  404.0
1     Dog  101.0  202.0
2  Parrot    8.0    9.0
3  Rabbit  500.0  600.0

Now, if your DataFrames are already aligned. I.e. the Animals are in the same order and the DataFrame indices are identical. You could use quick trick to ignore the Animal column: set that of df2 as empty string:

df_add = df1.add(df2.assign(Animal=''))

Output:

   Animal    a    b
0     Dog  101  202
1     Cat  303  404
2  Rabbit  505  606

This is however risky if you're not fully sure if the animals and DataFrame indices are identical.

answered Aug 1, 2024 at 8:46

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dinks123 Over a year ago

Thank you, in my case i am sure the indices are identical. But for safety i will use your 1st suggestion.

Fomalhaut · Accepted Answer · 2024-08-01 09:34:30Z

1

I would recommend you to keep the columns for the key and the aggregation as separate variables so the code would be prettier and more flexible:

key_columns = ['Animal']
sum_columns = ['a', 'b']

After that all you need to do is to concatenate the dataframes and aggregate them by the key columns:

pd.concat([df1, df2], ignore_index=True) \
    .fillna(0) \
    .groupby(key_columns) \
    .agg({col: 'sum' for col in sum_columns}) \
    .reset_index()

In case if you don't want to list all the columns in sum_columns, you can calculate it from all column names in df1.columns and key_columns like this, for example:

key_columns = ['Animal']
sum_columns = list(set(df1.columns) - set(key_columns))

Enjoy!

PS: fillna(0) fills NaN values with 0, reset_index() restores Animal as a column, because it became an index after groupby...agg procedure.

edited Aug 1, 2024 at 9:34

answered Aug 1, 2024 at 8:57

Fomalhaut

9,99111 gold badges59 silver badges111 bronze badges

2 Comments

Dinks123 Over a year ago

Thank you i think this is also a very useful solution, i like it. My only concern is that if you have a scenario where the number of columns is large (>10) writing out all the column names in sum_columns may not be practical.

Fomalhaut Over a year ago

@Dinks123 I edited the answer. You can calculate aggregation names as it mentioned.

Collectives™ on Stack Overflow

Adding two pandas dataframes together - Specific Columns

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related