0

consider following input data

prod col1 col2
One hi hello
One 18.0 19.52
One 2024-02-12 00:00:00 2024-03-07 00:00:00
two 2024-02-12 00:00:00 2024-02-11 00:00:00
two in-transit in-stock

want to find difference between col1 and col2, since there is difference in datatype in each row, I am facing difficulty to apply pandas functions. using SQL knowledge tried this code but didn't work

logic:

  1. if str then difference = "not same"
  2. if datetime then difference = (col2-col1).days
  3. else difference = col2 - col1
df["difference"] = np.where( df['col2'].apply(lambda x: isinstance(x, str)), "not same", 
                                df["col2"].apply(lambda x: isinstance(x, datetime)), (df['col2'] - df['col1']).dt.days, 
                                df['old_value'] - df['new_value'])


** Not getting expected output, datetime is still in timedelta

Expected output:

prod col1 col2 difference
One hi hello not same
One 18.0 19.52 1.52
One 2024-02-12 00:00:00 2024-03-07 00:00:00 25
two 2024-02-12 00:00:00 2024-02-11 00:00:00 1
two in-transit in-stock not same

Any other approach please suggest

1
  • Please provide a minimal reproducible example of the input as code (e.g. output of df.to_dict('list')) Commented Apr 23, 2024 at 7:18

2 Answers 2

1

I think the most reliable would be to convert the two columns to_numeric/to_datetime and perform the differences/comparisons in the desired order:

import numpy as np

cols = ['col1', 'col2']

tmp_num = df[cols].apply(pd.to_numeric, errors='coerce')
tmp_date = df[cols].apply(pd.to_datetime, errors='coerce')

df['difference'] = (
 tmp_num[cols[1]].sub(tmp_num[cols[0]]).abs()
 .fillna(tmp_date[cols[1]].sub(tmp_date[cols[0]]).dt.days.abs())
 .fillna(pd.Series(np.where(df[cols[0]].ne(df[cols[1]]), 'not same', np.nan),
                   index=df.index))
)

Variant:

c1, c2  = 'col1', 'col2'

df['difference'] = (
 pd.to_numeric(df[c1], errors='coerce')
   .sub(pd.to_numeric(df[c2], errors='coerce'))
   .fillna(pd.to_datetime(df[c1], errors='coerce')
             .sub(pd.to_datetime(df[c2], errors='coerce'))
             .dt.days
          )
   .abs()
   .fillna(df[c1].ne(df[c2]).map({True: 'not same'}))
)

Output:

  prod                 col1                 col2 difference
0  One                   hi                hello   not same
1  One                 18.0                19.52       1.52
2  One  2024-02-12 00:00:00  2024-03-07 00:00:00       24.0
3  two  2024-02-12 00:00:00  2024-02-11 00:00:00        1.0
4  two           in-transit             in-stock   not same
Sign up to request clarification or add additional context in comments.

Comments

0
def calculate_difference(row):
    val1, val2 = row['col1'], row['col2']
    if isinstance(val1, (int, float)) and isinstance(val2, (int, float)):
        return abs(val1 - val2)
    elif isinstance(val1, pd.Timestamp) and isinstance(val2, pd.Timestamp):
        return abs((val1 - val2).days)
    elif isinstance(val1, str) and isinstance(val2, str):
        return "not same" if val1 != val2 else "same"
    else:
        return "not comparable"


df['difference'] = df.apply(calculate_difference, axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.