1

I have below pandas data frame. Here field1, field2...are always variable, Wheras col1, col2 ....coln mostly constant and change infrequently. Ultimately i save this in parquet format.parquet internally optimize the duplicates, it is not an issue.

I have another requirement to convert it to csv from parquet.The csv file size is shooting up. So i want to remove duplicates before saving it in parquet. Doing this through code would increases the time as there could be 70-100 such column.

date                          field1 field2 col1 col2 col3 col5
20200508062904.8340+0530       11       22      2     3    3   4
20200508062904.8340+0530       12       23      2     3    3   4
20200508062904.8340+0530       13       22      2     3    3   4
20200508062904.8340+0530       14       24      2     3    3   4
20200508051804.8340+0530       14       24      2     3    3   5
20200508051804.8340+0530       14       24      2     4    3   4
20200508051804.8340+0530       14       24      2     3    3   4

For columns (col1 col2 col3 col5) I want to keep initial value and remove duplicates repeating values. In case if these column has different value than initial value at some later point of time data frame should keep it.Initial value is relative and is equal previous latest.

Expected Output

 date                          field1 field2 col1 col2 col3 col5
20200508062904.8340+0530       11       22      2   3    3   4
20200508062904.8340+0530       12       23      
20200508062904.8340+0530       13       22      
20200508062904.8340+0530       14       24      
20200508051804.8340+0530       14       24                    5
20200508051804.8340+0530       14       24               4    4
20200508051804.8340+0530       14       24               3        

drop_duplicates help delete rows, it is not useful in this case. Is there any better way to achieve this in pandas.

0

3 Answers 3

6

Create a mask checking if the column is equal to itself shifted, then fill the missing values

cols = [x for x in df.columns if x.startswith('col')]

#@AndyL. points out this equivalent mask is far simpler
m = df[cols].ne(df[cols].shift())

df[cols] = df[cols].astype('O').where(m).fillna('')

                       date  field1  field2 col1 col2 col3 col5
0  20200508062904.8340+0530      11      22    2    3    3    4
1  20200508062904.8340+0530      12      23                    
2  20200508062904.8340+0530      13      22                    
3  20200508062904.8340+0530      14      24                    
4  20200508051804.8340+0530      14      24                   5
5  20200508051804.8340+0530      14      24         4         4
6  20200508051804.8340+0530      14      24         3          

Previously used the unnecessarily complicated mask:

m = ~df[cols].ne(df[cols].shift()).cumsum().apply(pd.Series.duplicated)
Sign up to request clarification or add additional context in comments.

2 Comments

Nice answer :) +1. Just a question, is there a reason for using additional cumsum and check duplicated instead of just df[cols].ne(df[cols].shift()) ?
@AndyL. Yeah that's way simpler. Seems I got caught up in wanting to use .drop_duplicates
3

You could use .where and .shift to make consecutive values blank and do this for each column. If you have many columns, then you could do the below in a loop as @ALollz has done in his answer.

df['col1'] = df['col1'].where(df['col1'] != df['col1'].shift(), '')

Full code with a loop:

for col in df.columns:
    if 'col' in col:
        df[col] = df[col].where(df[col] != df[col].shift(), '')

output:

    date                        field1  field2  col1    col2    col3    col5
0   20200508062904.8340+0530    11      22      2       3       3       4
1   20200508062904.8340+0530    12      23              
2   20200508062904.8340+0530    13      22              
3   20200508062904.8340+0530    14      24              
4   20200508051804.8340+0530    14      24                              5
5   20200508051804.8340+0530    14      24              4               4
6   20200508051804.8340+0530    14      24              3       

Comments

1

You may try diff and where with a callable and fillna, replace and update back to original df

cols = ['col1', 'col2', 'col3', 'col5']

df.update(df[cols].diff().eq(0).where(lambda x: x)
                               .replace(1,'').fillna(df[cols]))

Out[315]:
                       date  field1  field2 col1 col2 col3 col5
0  20200508062904.8340+0530      11      22    2    3    3    4
1  20200508062904.8340+0530      12      23
2  20200508062904.8340+0530      13      22
3  20200508062904.8340+0530      14      24
4  20200508051804.8340+0530      14      24                   5
5  20200508051804.8340+0530      14      24         4         4
6  20200508051804.8340+0530      14      24         3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.