0

I have this code:

test = {"number": ['1555','1666','1777', '1888', '1999'],
        "order_amount": ['100.00','200.00','-200.00', '300.00', '-150.00'],
        "number_of_refund": ['','','1666', '', '1888']
    }

df = pd.DataFrame(test)

Which returns the following dataframe:

  number order_amount number_of_refund
0   1555       100.00                 
1   1666       200.00                 
2   1777      -200.00             1666
3   1888       300.00                 
4   1999      -150.00             1888    

I want to remove order and order refund entries if:

  • "number_of_refund" matches a number from "number" column (there might not be a number of order in the dataframe if order was made last month and refund during the current month)
  • amount of "number_of_refund" (which was matched to "number") has a negative amount of "number" amount (in this case number 1666 has 200, and refund of 1666 has -200 so both rows should be removed)

So the result in this case should be:

number order_amount number_of_refund
0   1555       100.00                 
3   1888       300.00                 
4   1999      -150.00           1888                            

How do I check if amount of one column's value is in another column but with opposite amount (negative)?

2 Answers 2

3

IIUC, you can use a boolean indexing approach:

# ensure numeric values
df['order_amount'] = pd.to_numeric(df['order_amount'], errors='coerce')

# is the row a refund?
m1 = df['number_of_refund'].ne('')
# get mapping of refunds
s = df[m1].set_index('number_of_refund')['order_amount']

# get reimbursements and find which ones will equal the original value
reimb = df['number'].map(s)
m2 = reimb.eq(-df['order_amount'])
m3 = df['number_of_refund'].isin(df.loc[m2, 'number'])

# keep rows that do not match any m2 or m3 mask
df = df[~(m2|m3)]

output:

  number  order_amount number_of_refund
0   1555         100.0                 
3   1888         300.0                 
4   1999        -150.0             1888
Sign up to request clarification or add additional context in comments.

4 Comments

Hello, I have a problem with this solution, so basically instead of empty in number_of_refund field if it does not have value it is NULL so I am using: m1 = df['number_of_refund'].notna() and if there is no value in that field if shows an error on the s = df[m1].set_index('number_of_refund')['order_amount'] step. How could I fix this?
If you want to replace NaNs with empty strings do df['number_of_refund'] = df['number_of_refund'].fillna('')
I did that but it still shows error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects. Please wait, there might be a mistake in my fields formatting.
Did you have an initially duplicated index? You can drop it if it's not important: df = df.reset_index(drop=True), then fillna
2

Let's say I change the refunded amount for 1999 to be -200.00

test = {"number": ['1555','1666','1777', '1888', '1999'],
        "order_amount": ['100.00','200.00','-200.00', '300.00', '-200.00'],
        "number_of_refund": ['','','1666', '', '1888']  }
df = pd.DataFrame(test)
print(df)

  number order_amount number_of_refund
0   1555       100.00                 
1   1666       200.00                 
2   1777      -200.00             1666
3   1888       300.00                 
4   1999      -200.00             1888

Here's another way to do it. I create a unique string by concatenating the number_of_refund (filled with the number column on the blanks) and the absolute order_amount (ie, without the negative sign), then drop both duplicates found

df['unique'] = df.apply(lambda x: x['order_amount'].replace('-','')+'|'+x['number'] if x['number_of_refund']=='' else x['order_amount'].replace('-','')+'|'+x['number_of_refund'], axis=1)
#df['unique'] = df['order_amount'].str.replace('-','') + '|' + df['number_of_refund'].mask(df['number_of_refund'].eq(''), df['number'])  #the same
print(df)

  number order_amount number_of_refund       unique
0   1555       100.00                   100.00|1555
1   1666       200.00                   200.00|1666    #duplicate
2   1777      -200.00             1666  200.00|1666    #duplicate
3   1888       300.00                   300.00|1888
4   1999      -200.00             1888  200.00|1888

The duplicate rows are easily identified, and ready to be dropped (including the column unique)

df = df.drop_duplicates(['unique'], keep=False).drop(columns=['unique'])
print(df)

  number order_amount number_of_refund
0   1555       100.00                 
3   1888       300.00                 
4   1999      -200.00             1888

8 Comments

It's a bit dangerous however to rely on unique amounts. There could be the same about for 2 different orders by coincidence.
but the order number will differentiate them, right? That is, both the order number and the order amount form the unique string
@perpetualstudent not directly because the mapping id is split in two different columns (see. here 1666 is not duplicated in any column). You could use number_of_refund filled with the number column on the blanks though. ;)
I took the liberty to update to fix the flaw, fell free to revert if you want ;)
sure, separator added! Is this better? Cheers!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.