How to get a true/false without duplicates when comparing two Pandas dataframes?

Question

I have one dataframe with sessions - one session, one row, so SID is unique. The session has a doctor name.

SID	Doctor	Patient
1	robby	david
2	langdon	sara
3	langdon	michael

I have another dataframe with the SID, and a record of who opened the patient file. The opening person can be either the doctor or anyone else from the clinic. If two different people from the clinic open the patient file in the SID, I will have two rows with the same SID, only different opener_name.

SID	opener_name
1	robby
1	dana
2	dana

I want to generate a true/false column in the sessions dataframe for:

If the doctor opened the file
If anyone opened the file at all (either the doctor or anyone else)

Sessions were not necessarily opened by anyone, and if not wont appear at all.

The output I desire is this:

SID	Doctor	Patient	is_doctor_opened	is_anyone_opened
1	robby	david	True	True
2	langdon	sara	False	True
3	langdon	michael	False	False

If I merge the two files on session ID, I will get duplicate rows, and I'm not sure how to rid of the duplicates in that scenario.

I've also tried playing around with simple booleans but I run into problems.

How do I get an organized dataframe with the booleans and keep it to one session, one row?

Would be better if you add sample source tables and a desired resulting table. — strawdog
– strawdog, Commented Nov 16 at 12:22
I'm voting to reopen, but more details would help: 1) Show your code please, even if it's just the merge. For one thing, showing the variable names helps keep answers consistent. 2) What do you mean by "playing around with simple booleans"? Please edit to clarify and show code if possible. 3) What research have you done? E.g. are you aware of .drop_duplicates()? — wjandrea
– wjandrea, Commented Nov 18 at 13:54

wjandrea · Accepted Answer · 2025-11-19 12:23:52Z

Nice — this is a classic join + groupby task in pandas. Two clean approaches below (both give the exact output you showed). Pick whichever reads better to you.

I'll use your example data and show code + result.

import pandas as pd

sessions = pd.DataFrame({
    'SID': [1, 2, 3],
    'Doctor': ['robby', 'langdon', 'langdon'],
    'Patient': ['david', 'sara', 'michael']})

openers = pd.DataFrame({
    'SID': [1, 1, 2],
    'opener_name': ['robby', 'dana', 'dana']})

Method A — simple & fast (using sets)

# who opened anything
opened_sids = set(openers['SID'].unique())
sessions['is_anyone_opened'] = sessions['SID'].isin(opened_sids)

# build a set of (SID, opener) pairs that match doctor
# merge to get doctor on the opener rows
merged = openers.merge(sessions[['SID', 'Doctor']], on='SID', how='left')
# rows where opener == doctor
doctor_open_rows = merged[
    merged['opener_name'] == merged['Doctor']
]['SID'].unique()
sessions['is_doctor_opened'] = sessions['SID'].isin(doctor_open_rows)

# ensure boolean dtype
sessions['is_anyone_opened'] = sessions['is_anyone_opened'].astype(bool)
sessions['is_doctor_opened'] = sessions['is_doctor_opened'].astype(bool)

print(sessions)

Method B — explicit merge + groupby (robust, great for large data)

This uses groupby(...).any() so duplicates don’t make extra rows.

# 1) is_anyone_opened: any record for SID?
anyone = openers[['SID']].drop_duplicates().assign(is_anyone_opened=True)

# 2) is_doctor_opened: merge opener rows with sessions to compare 
# names, then groupby SID
merged = openers.merge(sessions[['SID', 'Doctor']], on='SID', how='left')
merged['is_doctor_opened'] = merged['opener_name'] == merged['Doctor']
doctor_flag = merged.groupby('SID', as_index=False)['is_doctor_opened'].any()

# 3) left-join these flags back to sessions; missing -> False
result = (
    sessions
    .merge(anyone, on='SID', how='left')
    .merge(doctor_flag, on='SID', how='left')
    .fillna({'is_anyone_opened': False, 'is_doctor_opened': False})
)

# convert to bool
result['is_anyone_opened'] = result['is_anyone_opened'].astype(bool)
result['is_doctor_opened'] = result['is_doctor_opened'].astype(bool)

print(result)

wjandrea · Accepted Answer · 2025-11-19 13:46:30Z

If I merge the two files on session ID, I will get duplicate rows, and I'm not sure how to [get] rid of the duplicates in that scenario.

For is_doctor_opened, merge on Doctor as well. For is_anyone_opened, de-dupe before merging.

Here I'm going to merge the actual values then do .notna() after to get the boolean desired result. This technique has the best intermediate values IMHO.

Style note: I like to use chaining, but this chain is pretty lengthy so you might prefer to rewrite it in a more imperative style.

(
    sessions
    # Setup for `is_doctor_opened`
    .merge(
        openers.rename(columns={'opener_name': 'doctor_opener_name'}),
        left_on=['SID', 'Doctor'],
        right_on=['SID', 'doctor_opener_name'],
        how='left',
    )
    # Setup for `is_anyone_opened`
    .merge(
        openers.groupby('SID').agg(list),
        on='SID',
        how='left',
    )
    # Switch to boolean
    .assign(
        is_doctor_opened=lambda d: d['doctor_opener_name'].notna(),
        is_anyone_opened=lambda d: d['opener_name'].notna(),
    )
    .drop(columns=['doctor_opener_name', 'opener_name'])
)

   SID   Doctor  Patient  is_doctor_opened  is_anyone_opened
0    1    robby    david              True              True
1    2  langdon     sara             False              True
2    3  langdon  michael             False             False

Intermediates:

   SID   Doctor  Patient doctor_opener_name    opener_name
0    1    robby    david              robby  [robby, dana]
1    2  langdon     sara                NaN         [dana]
2    3  langdon  michael                NaN            NaN

P.S. I tried a few variations before settling on this one, including setting the "opened" columns before merging. This version is the cleanest and has the best intermediates IMHO.

Collectives™ on Stack Overflow

How to get a true/false without duplicates when comparing two Pandas dataframes?

2 Answers 2

Method A — simple & fast (using sets)

Method B — explicit merge + groupby (robust, great for large data)

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Method A — simple & fast (using sets)

Method B — explicit merge + groupby (robust, great for large data)

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related