removing duplicate column values from pandas dataframe

Question

I have below pandas data frame. Here field1, field2...are always variable, Wheras col1, col2 ....coln mostly constant and change infrequently. Ultimately i save this in parquet format.parquet internally optimize the duplicates, it is not an issue.

I have another requirement to convert it to csv from parquet.The csv file size is shooting up. So i want to remove duplicates before saving it in parquet. Doing this through code would increases the time as there could be 70-100 such column.

date                          field1 field2 col1 col2 col3 col5
20200508062904.8340+0530       11       22      2     3    3   4
20200508062904.8340+0530       12       23      2     3    3   4
20200508062904.8340+0530       13       22      2     3    3   4
20200508062904.8340+0530       14       24      2     3    3   4
20200508051804.8340+0530       14       24      2     3    3   5
20200508051804.8340+0530       14       24      2     4    3   4
20200508051804.8340+0530       14       24      2     3    3   4

For columns (col1 col2 col3 col5) I want to keep initial value and remove duplicates repeating values. In case if these column has different value than initial value at some later point of time data frame should keep it.Initial value is relative and is equal previous latest.

Expected Output

 date                          field1 field2 col1 col2 col3 col5
20200508062904.8340+0530       11       22      2   3    3   4
20200508062904.8340+0530       12       23      
20200508062904.8340+0530       13       22      
20200508062904.8340+0530       14       24      
20200508051804.8340+0530       14       24                    5
20200508051804.8340+0530       14       24               4    4
20200508051804.8340+0530       14       24               3

drop_duplicates help delete rows, it is not useful in this case. Is there any better way to achieve this in pandas.

ALollz · Accepted Answer · 2020-07-23 19:27:08Z

6

Create a mask checking if the column is equal to itself shifted, then fill the missing values

cols = [x for x in df.columns if x.startswith('col')]

#@AndyL. points out this equivalent mask is far simpler
m = df[cols].ne(df[cols].shift())

df[cols] = df[cols].astype('O').where(m).fillna('')

                       date  field1  field2 col1 col2 col3 col5
0  20200508062904.8340+0530      11      22    2    3    3    4
1  20200508062904.8340+0530      12      23                    
2  20200508062904.8340+0530      13      22                    
3  20200508062904.8340+0530      14      24                    
4  20200508051804.8340+0530      14      24                   5
5  20200508051804.8340+0530      14      24         4         4
6  20200508051804.8340+0530      14      24         3

Previously used the unnecessarily complicated mask:

m = ~df[cols].ne(df[cols].shift()).cumsum().apply(pd.Series.duplicated)

edited Jul 23, 2020 at 19:27

answered Jul 23, 2020 at 18:01

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andy L. Over a year ago

Nice answer :) +1. Just a question, is there a reason for using additional cumsum and check duplicated instead of just df[cols].ne(df[cols].shift()) ?

ALollz Over a year ago

@AndyL. Yeah that's way simpler. Seems I got caught up in wanting to use .drop_duplicates

David Erickson · Accepted Answer · 2020-07-23 18:17:01Z

You could use .where and .shift to make consecutive values blank and do this for each column. If you have many columns, then you could do the below in a loop as @ALollz has done in his answer.

df['col1'] = df['col1'].where(df['col1'] != df['col1'].shift(), '')

Full code with a loop:

for col in df.columns:
    if 'col' in col:
        df[col] = df[col].where(df[col] != df[col].shift(), '')

output:

    date                        field1  field2  col1    col2    col3    col5
0   20200508062904.8340+0530    11      22      2       3       3       4
1   20200508062904.8340+0530    12      23              
2   20200508062904.8340+0530    13      22              
3   20200508062904.8340+0530    14      24              
4   20200508051804.8340+0530    14      24                              5
5   20200508051804.8340+0530    14      24              4               4
6   20200508051804.8340+0530    14      24              3

Andy L. · Accepted Answer · 2020-07-23 19:15:18Z

1

You may try diff and where with a callable and fillna, replace and update back to original df

cols = ['col1', 'col2', 'col3', 'col5']

df.update(df[cols].diff().eq(0).where(lambda x: x)
                               .replace(1,'').fillna(df[cols]))

Out[315]:
                       date  field1  field2 col1 col2 col3 col5
0  20200508062904.8340+0530      11      22    2    3    3    4
1  20200508062904.8340+0530      12      23
2  20200508062904.8340+0530      13      22
3  20200508062904.8340+0530      14      24
4  20200508051804.8340+0530      14      24                   5
5  20200508051804.8340+0530      14      24         4         4
6  20200508051804.8340+0530      14      24         3

edited Jul 23, 2020 at 19:15

answered Jul 23, 2020 at 19:01

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Collectives™ on Stack Overflow

removing duplicate column values from pandas dataframe

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related