I have below pandas data frame. Here field1, field2...are always variable, Wheras col1, col2 ....coln mostly constant and change infrequently. Ultimately i save this in parquet format.parquet internally optimize the duplicates, it is not an issue.
I have another requirement to convert it to csv from parquet.The csv file size is shooting up. So i want to remove duplicates before saving it in parquet. Doing this through code would increases the time as there could be 70-100 such column.
date field1 field2 col1 col2 col3 col5
20200508062904.8340+0530 11 22 2 3 3 4
20200508062904.8340+0530 12 23 2 3 3 4
20200508062904.8340+0530 13 22 2 3 3 4
20200508062904.8340+0530 14 24 2 3 3 4
20200508051804.8340+0530 14 24 2 3 3 5
20200508051804.8340+0530 14 24 2 4 3 4
20200508051804.8340+0530 14 24 2 3 3 4
For columns (col1 col2 col3 col5) I want to keep initial value and remove duplicates repeating values. In case if these column has different value than initial value at some later point of time data frame should keep it.Initial value is relative and is equal previous latest.
Expected Output
date field1 field2 col1 col2 col3 col5
20200508062904.8340+0530 11 22 2 3 3 4
20200508062904.8340+0530 12 23
20200508062904.8340+0530 13 22
20200508062904.8340+0530 14 24
20200508051804.8340+0530 14 24 5
20200508051804.8340+0530 14 24 4 4
20200508051804.8340+0530 14 24 3
drop_duplicates help delete rows, it is not useful in this case. Is there any better way to achieve this in pandas.