3

How to go about removing duplicates column by column in a pandas data frame so that:

set1    set2    set3    set4
apple   apple   orange  orange
apple   orange  banana  orange
orange  banana  pear    
banana  banana  lemon   
pear            lemon   
grape           lemon

becomes:

set1    set2    set3    set4
apple   apple   orange  orange
orange  orange  banana  
banana  banana  pear    
pear            lemon   
grape   
1
  • 2
    If you want the unique values from a column, you can always do df['column_name'].unique(). Commented Aug 30, 2019 at 13:00

4 Answers 4

3

Use:

m=df.apply(lambda x:dict.fromkeys(x).keys())
pd.DataFrame(m.values.tolist(),index=m.index).T

Or a better way courtesy @piRSquared:

pd.DataFrame.from_dict({k: {*df[k].dropna()} for k in df}, orient='index').T

     set1    set2    set3    set4
0   apple   apple  orange  orange
1  orange  orange  banana     NaN
2  banana  banana    pear    None
3    pear     NaN   lemon    None
4   grape    None    None    None
Sign up to request clarification or add additional context in comments.

1 Comment

pd.DataFrame.from_dict({k: {*df[k].dropna()} for k in df}, orient='index').T
3

Here is another way pivot

df.melt().dropna().drop_duplicates(['variable','value']).\
   assign(key=lambda x : x.groupby('variable').cumcount()).pivot(index='key',columns='variable',values='value')
Out[806]: 
variable    set1    set2    set3    set4
key                                     
0          apple   apple  orange  orange
1         orange  orange  banana     NaN
2         banana  banana    pear     NaN
3           pear     NaN   lemon     NaN
4          grape     NaN     NaN     NaN

Comments

3

itertools.zip_longest

from itertools import zip_longest

pd.DataFrame(
    [*zip_longest(*({*df[c].dropna()} for c in df))],
    columns=[*df]
)

     set1    set2    set3    set4
0  banana  orange  banana  orange
1   grape  banana   lemon    None
2    pear   apple    pear    None
3   apple    None  orange    None
4  orange    None    None    None

collections.defaultdict and itertools.count

# %%timeit
from collections import defaultdict
from itertools import count
i = defaultdict(count)

pd.DataFrame({c: {next(i[c]): v for v in {*df[c].dropna()}} for c in df})

     set1    set2    set3    set4
0    pear   apple  orange  orange
1   grape  banana   lemon     NaN
2   apple  orange  banana     NaN
3  banana     NaN    pear     NaN
4  orange     NaN     NaN     NaN

Comments

1

You can also use drop_duplicates :

df.apply(lambda x : x.drop_duplicates().reset_index(drop=True))

>

     set1    set2    set3    set4
0   apple   apple  orange  orange
1  orange  orange  banana     NaN
2  banana  banana    pear     NaN
3    pear     NaN   lemon     NaN
4   grape     NaN     NaN     NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.