How to remove duplicate values from the python dataframe?

Question

I have a dataframe with duplicate values in either list or string format.

df = Name                  Email                            years        score 
     john           [[email protected],[email protected], [email protected]]                8           good
               
     
     [devan,smith ,devan]   [[email protected]]                   [8,6,8]           good

I want to remove duplicate values within that particular cell, not to compare corresponding to different cells.

df_updated = Name                  Email                      years        score
             john           [[email protected],[email protected]]                 8            good
               
     
          [devan,smith]          [[email protected]]                   [8,6]         good

Your input data is ambiguous, please provide it as dataframe or dictionary Use df.to_dict('list') and update your question — mozway
– mozway, Commented Feb 18, 2022 at 9:48

jezrael · Accepted Answer · 2022-02-18 10:14:34Z

1

Use DataFrame.applymap for elementwise processing with custom function for remove duplicates if type is list:

df = pd.DataFrame({'Name':['John', ['aa','devan','smith','devan']],
                   'years':[8, [8,6,8]]})

print (df)
                        Name      years
0                       John          8
1  [aa, devan, smith, devan]  [8, 6, 8]

df1 = df.applymap(lambda x: list(dict.fromkeys(x)) if isinstance(x, list) else x)
print (df1)
                 Name   years
0                John       8
1  [aa, devan, smith]  [8, 6]

If ordering is not important use sets:

df2 = df.applymap(lambda x: list(set(x)) if isinstance(x, list) else x)
print (df2)
                 Name   years
0                John       8
1  [devan, smith, aa]  [8, 6]

edited Feb 18, 2022 at 10:14

answered Feb 18, 2022 at 9:50

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

mozway Over a year ago

looks like dupe ;) I thought of the exact same solution and I imagined it already existed

jezrael Over a year ago

@mozway - hmm, partly.

mozway Over a year ago

I would say fully "After some discussion solution is: df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))" except map in place of applymap and set instead of dict.from_keys, but you also showed those in the dupe

jezrael Over a year ago

@mozway - ya, partly dupe. part solution is same, part not.

TheFaultInOurStars · Accepted Answer · 2022-02-18 09:49:38Z

0

Without the main dataframe, it is hard to guess how your dataframe functions. Anyway, here is what you probably need:

df["Email"].apply(set)

Note that Email column should be list. If you are interested in removing duplicated from other columns, let's say Name column, try replacing Name in place of Email in the abovementioned cell.

answered Feb 18, 2022 at 9:49

TheFaultInOurStars

3,6331 gold badge13 silver badges30 bronze badges

Collectives™ on Stack Overflow

How to remove duplicate values from the python dataframe?

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related