Pandas, find difference between two columns, each having different datatype values

Question

consider following input data

prod	col1	col2
One	hi	hello
One	18.0	19.52
One	2024-02-12 00:00:00	2024-03-07 00:00:00
two	2024-02-12 00:00:00	2024-02-11 00:00:00
two	in-transit	in-stock

want to find difference between col1 and col2, since there is difference in datatype in each row, I am facing difficulty to apply pandas functions. using SQL knowledge tried this code but didn't work

logic:

if str then difference = "not same"
if datetime then difference = (col2-col1).days
else difference = col2 - col1

df["difference"] = np.where( df['col2'].apply(lambda x: isinstance(x, str)), "not same", 
                                df["col2"].apply(lambda x: isinstance(x, datetime)), (df['col2'] - df['col1']).dt.days, 
                                df['old_value'] - df['new_value'])

** Not getting expected output, datetime is still in timedelta

Expected output:

prod	col1	col2	difference
One	hi	hello	not same
One	18.0	19.52	1.52
One	2024-02-12 00:00:00	2024-03-07 00:00:00	25
two	2024-02-12 00:00:00	2024-02-11 00:00:00	1
two	in-transit	in-stock	not same

Any other approach please suggest

Please provide a minimal reproducible example of the input as code (e.g. output of df.to_dict('list')) — mozway
– mozway, Commented Apr 23, 2024 at 7:18

mozway · Accepted Answer · 2024-04-23 07:31:50Z

I think the most reliable would be to convert the two columns to_numeric/to_datetime and perform the differences/comparisons in the desired order:

import numpy as np

cols = ['col1', 'col2']

tmp_num = df[cols].apply(pd.to_numeric, errors='coerce')
tmp_date = df[cols].apply(pd.to_datetime, errors='coerce')

df['difference'] = (
 tmp_num[cols[1]].sub(tmp_num[cols[0]]).abs()
 .fillna(tmp_date[cols[1]].sub(tmp_date[cols[0]]).dt.days.abs())
 .fillna(pd.Series(np.where(df[cols[0]].ne(df[cols[1]]), 'not same', np.nan),
                   index=df.index))
)

Variant:

c1, c2  = 'col1', 'col2'

df['difference'] = (
 pd.to_numeric(df[c1], errors='coerce')
   .sub(pd.to_numeric(df[c2], errors='coerce'))
   .fillna(pd.to_datetime(df[c1], errors='coerce')
             .sub(pd.to_datetime(df[c2], errors='coerce'))
             .dt.days
          )
   .abs()
   .fillna(df[c1].ne(df[c2]).map({True: 'not same'}))
)

Output:

  prod                 col1                 col2 difference
0  One                   hi                hello   not same
1  One                 18.0                19.52       1.52
2  One  2024-02-12 00:00:00  2024-03-07 00:00:00       24.0
3  two  2024-02-12 00:00:00  2024-02-11 00:00:00        1.0
4  two           in-transit             in-stock   not same

Soudipta Dutta · Accepted Answer · 2024-06-04 13:03:16Z

0

def calculate_difference(row):
    val1, val2 = row['col1'], row['col2']
    if isinstance(val1, (int, float)) and isinstance(val2, (int, float)):
        return abs(val1 - val2)
    elif isinstance(val1, pd.Timestamp) and isinstance(val2, pd.Timestamp):
        return abs((val1 - val2).days)
    elif isinstance(val1, str) and isinstance(val2, str):
        return "not same" if val1 != val2 else "same"
    else:
        return "not comparable"


df['difference'] = df.apply(calculate_difference, axis=1)

answered Jun 4, 2024 at 13:03

Soudipta Dutta

2,0721 gold badge16 silver badges11 bronze badges

Collectives™ on Stack Overflow

Pandas, find difference between two columns, each having different datatype values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related