Using pandas, identify similar values between columns [duplicate]

Question

I have two columns,

A         B         
2001     2003
2003     1999
1990     2001
1995     2010
2004     1996

I want to check if there are values similar between the two columns regardless of the rows and place it in a new column (SIMILAR)

This is the output that I would like to have

A        B        SIMILAR
2001     2003     2003
2003     1999     2001
1990     2001
1995     2010
2004     1996

Thank you

Please share what has been tried so far.

VN'sCorner
– VN'sCorner

2020-04-29 18:59:03 +00:00
Commented Apr 29, 2020 at 18:59 — VN'sCorner
– VN'sCorner, Commented Apr 29, 2020 at 18:59
Define "similar".

Scott Boston
– Scott Boston

2020-04-29 19:02:30 +00:00
Commented Apr 29, 2020 at 19:02 — Scott Boston
– Scott Boston, Commented Apr 29, 2020 at 19:02

It_is_Chris · Accepted Answer · 2020-04-29 19:01:52Z

1

IIUC you can use isin:

df[df['A'].isin(df['B'])]['A'].values

answered Apr 29, 2020 at 19:01

It_is_Chris

14.2k3 gold badges27 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

timgeb · Accepted Answer · 2020-04-29 19:03:52Z

1

If by "similar" you mean equal, I'd solve this with the isin method. I'm also assuming that the order of values in the new column does not matter.

>>> df['SIMILAR'] = df.loc[df['A'].isin(df['B']), 'A']
>>> df
      A     B  SIMILAR
0  2001  2003   2001.0
1  2003  1999   2003.0
2  1990  2001      NaN
3  1995  2010      NaN
4  2004  1996      NaN

answered Apr 29, 2020 at 19:03

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

Comments

Vitor Albres · Accepted Answer · 2020-04-29 19:07:03Z

0

To find the duplicated values you can do something like this:

duplicateRowsDF = pdData[pdData.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

The response should be somethin like this:

SIMILAR
2003
2001

Then you just use this new data to create a new colum

pdData["Similar"] = duplicateRowsDF

answered Apr 29, 2020 at 19:07

Vitor Albres

1233 silver badges11 bronze badges

Comments

rafaelc · Accepted Answer · 2020-04-29 19:09:55Z

0

Code-golfing with set intersection (assumes a standard range index):

df['C'] = pd.Series([*{*df.A} & {*df.B}])

      A     B       C
0  2001  2003  2001.0
1  2003  1999  2003.0
2  1990  2001     NaN
3  1995  2010     NaN
4  2004  1996     NaN

answered Apr 29, 2020 at 19:09

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

Collectives™ on Stack Overflow

Using pandas, identify similar values between columns [duplicate]

4 Answers 4

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Linked

Related