0

Consider the following data frame:

df = pd.DataFrame({
    'case_id': [1050, 1050, 1050, 1050, 1051, 1051, 1051, 1051],
    'elm_id': [101, 102, 101, 102, 101, 102, 101, 102],
    'cid': [1, 1, 2, 2, 1, 1, 2, 2],
    'fx': [736.1, 16.5, 98.8, 158.5, 272.5, 750.0, 333.4, 104.2],
    'fy': [992.0, 261.3, 798.3, 452.0, 535.9, 838.8, 526.7, 119.4],
    'fz': [428.4, 611.0, 948.3, 523.9, 880.9, 340.3, 890.7, 422.1]})

When printed looks like this:

--- case_id cid elm_id fx fy fz 0 1050 1 101 736.1 992.0 428.4 1 1050 1 102 16.5 261.3 611.0 2 1050 2 101 98.8 798.3 948.3 3 1050 2 102 158.5 452.0 523.9 4 1051 1 101 272.5 535.9 880.9 5 1051 1 102 750.0 838.8 340.3 6 1051 2 101 333.4 526.7 890.7 7 1051 2 102 104.2 119.4 422.1

I need to remove rows where duplicate values exist in the following two columns subcase and elm_id and retain the row with the highest cid. The data should look like this:

--- case_id cid elm_id fx fy fz 0 1050 2 101 98.8 798.3 948.3 1 1050 2 102 158.5 452.0 523.9 2 1051 2 101 333.4 526.7 890.7 3 1051 2 102 104.2 119.4 422.1

I'm new to pandas. Looking at other similar questions, I tried using .groupby() and max() like this: df2 = df.groupby(['case_id', 'elm_id']).max()['cid'].reset_index(). However I lost my columns fx, fy and fz. I feel like I'm close, I just don't know where to look next.

2 Answers 2

1

You'll need sort_values + drop_duplicates:

df.sort_values('cid', ascending=False).drop_duplicates(['case_id', 'elm_id'])

   case_id  cid  elm_id     fx     fy     fz
2     1050    2     101   98.8  798.3  948.3
3     1050    2     102  158.5  452.0  523.9
6     1051    2     101  333.4  526.7  890.7
7     1051    2     102  104.2  119.4  422.1
Sign up to request clarification or add additional context in comments.

4 Comments

This looks promising. However, I'm using pandas 0.15.2 which does not have method sort_values. I'm not sure if I can upgrade my pandas version behind my work firewall. Looking into it now...
@twegner try replacing sort_values by df.sort('cid', axis=1, ascending=False)?
raise ValueError('When sorting by column, axis must be 0 (rows). So changing to axis=0 made it work. Thanks!
@twegner That is just so dumb. Now I know why they dropped the API, lol.
0

Another way to this:

df[(df.duplicated(subset=['subcase','elm_id']))&(df['cid']>1)]

   case_id  cid  elm_id     fx     fy     fz
2     1050    2     101   98.8  798.3  948.3
3     1050    2     102  158.5  452.0  523.9
6     1051    2     101  333.4  526.7  890.7
7     1051    2     102  104.2  119.4  422.1

1 Comment

Only works because cid takes two values here: 1 and 2. Not a good approach in general.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.