Pandas remove duplicates with condition from data frame

Question

Consider the following data frame:

df = pd.DataFrame({
    'case_id': [1050, 1050, 1050, 1050, 1051, 1051, 1051, 1051],
    'elm_id': [101, 102, 101, 102, 101, 102, 101, 102],
    'cid': [1, 1, 2, 2, 1, 1, 2, 2],
    'fx': [736.1, 16.5, 98.8, 158.5, 272.5, 750.0, 333.4, 104.2],
    'fy': [992.0, 261.3, 798.3, 452.0, 535.9, 838.8, 526.7, 119.4],
    'fz': [428.4, 611.0, 948.3, 523.9, 880.9, 340.3, 890.7, 422.1]})

When printed looks like this:

--- case_id cid elm_id fx fy fz 0 1050 1 101 736.1 992.0 428.4 1 1050 1 102 16.5 261.3 611.0 2 1050 2 101 98.8 798.3 948.3 3 1050 2 102 158.5 452.0 523.9 4 1051 1 101 272.5 535.9 880.9 5 1051 1 102 750.0 838.8 340.3 6 1051 2 101 333.4 526.7 890.7 7 1051 2 102 104.2 119.4 422.1

I need to remove rows where duplicate values exist in the following two columns subcase and elm_id and retain the row with the highest cid. The data should look like this:

--- case_id cid elm_id fx fy fz 0 1050 2 101 98.8 798.3 948.3 1 1050 2 102 158.5 452.0 523.9 2 1051 2 101 333.4 526.7 890.7 3 1051 2 102 104.2 119.4 422.1

I'm new to pandas. Looking at other similar questions, I tried using .groupby() and max() like this: df2 = df.groupby(['case_id', 'elm_id']).max()['cid'].reset_index(). However I lost my columns fx, fy and fz. I feel like I'm close, I just don't know where to look next.

cs95 · Accepted Answer · 2018-05-30 22:49:54Z

1

You'll need sort_values + drop_duplicates:

df.sort_values('cid', ascending=False).drop_duplicates(['case_id', 'elm_id'])

   case_id  cid  elm_id     fx     fy     fz
2     1050    2     101   98.8  798.3  948.3
3     1050    2     102  158.5  452.0  523.9
6     1051    2     101  333.4  526.7  890.7
7     1051    2     102  104.2  119.4  422.1

answered May 30, 2018 at 22:49

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

twegner Over a year ago

This looks promising. However, I'm using pandas 0.15.2 which does not have method sort_values. I'm not sure if I can upgrade my pandas version behind my work firewall. Looking into it now...

cs95 Over a year ago

@twegner try replacing sort_values by df.sort('cid', axis=1, ascending=False)?

twegner Over a year ago

raise ValueError('When sorting by column, axis must be 0 (rows). So changing to axis=0 made it work. Thanks!

cs95 Over a year ago

@twegner That is just so dumb. Now I know why they dropped the API, lol.

Ivanovitch · Accepted Answer · 2018-05-30 22:58:39Z

0

Another way to this:

df[(df.duplicated(subset=['subcase','elm_id']))&(df['cid']>1)]

   case_id  cid  elm_id     fx     fy     fz
2     1050    2     101   98.8  798.3  948.3
3     1050    2     102  158.5  452.0  523.9
6     1051    2     101  333.4  526.7  890.7
7     1051    2     102  104.2  119.4  422.1

answered May 30, 2018 at 22:58

Ivanovitch

3681 gold badge2 silver badges11 bronze badges

1 Comment

cs95 Over a year ago

Only works because cid takes two values here: 1 and 2. Not a good approach in general.

Collectives™ on Stack Overflow

Pandas remove duplicates with condition from data frame

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related