0

I am creating script that reads xlsx file to pandas dataframe and appends new rows to it. However, my problem is that I don't want to add dublicates that have same values in the first four columns (contains 5 columns overall). The fifth column value can be anything, but based on dublicates on these four columns I would like to delete the whole row.

My code is fully functional apart from this. I could do this by looping the dataframe, but I believe that there is smarter way to do this.

Example of data in below. How can I delete the last row, when it has same four columns as the row 4 but different 5th column?

    Category Year Week Price Amount
0   1        2019 27   2     1
1   1        2019 28   3     2
2   1        2019 29   4     3
3   2        2019 29   4     4
4   3        2019 30   5     3
5   3        2019 30   5     4

Part of the code:

# Append new rows to dataframe
file_df = file_df.append(new_rows, sort=False, ignore_index=True)

# Delete dublicate rows
combined_df = combined_df.drop_duplicates()

This code now removes only the rows with exactly same column values. Anyway, I could not find smart solution for removing such duplicates. Please correct me, if the question is not relevant.

1 Answer 1

4

try pd.drop_duplicates and set subset column on which you want to compare values

df.drop_duplicates(subset=['Category' ,'Year', 'Week' ,'Price'],inplace=True)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.