I am creating script that reads xlsx file to pandas dataframe and appends new rows to it. However, my problem is that I don't want to add dublicates that have same values in the first four columns (contains 5 columns overall). The fifth column value can be anything, but based on dublicates on these four columns I would like to delete the whole row.
My code is fully functional apart from this. I could do this by looping the dataframe, but I believe that there is smarter way to do this.
Example of data in below. How can I delete the last row, when it has same four columns as the row 4 but different 5th column?
Category Year Week Price Amount
0 1 2019 27 2 1
1 1 2019 28 3 2
2 1 2019 29 4 3
3 2 2019 29 4 4
4 3 2019 30 5 3
5 3 2019 30 5 4
Part of the code:
# Append new rows to dataframe
file_df = file_df.append(new_rows, sort=False, ignore_index=True)
# Delete dublicate rows
combined_df = combined_df.drop_duplicates()
This code now removes only the rows with exactly same column values. Anyway, I could not find smart solution for removing such duplicates. Please correct me, if the question is not relevant.