Pandas Group by multiple columns and select non null last values from non-grouped cols

Question

Hi I have an excel sheet with fields like this

FName, LastName, DOB, BirthPlace, Address, Design

John,  Cash,          21-09-1986, Darwin, , , 
John,  Cash,             , ,   , 22 Howard Springs Darwin, 
John,  Cash,        ,          , 20 Howard Springs Darwin     , Supervisor

So I want to groupby Fname, LastName, DOB, Birth Place but then I want pick only "last" non null fields off of other columns like Address and Desig. Also please note that I won't get all the col values in every row though there can be many updates for the same person. Like in 2nd row above I don't have DOB, BirthPlace, Desig. I can have many such rows with only one col updated each item but names and dob fields present most of the time.

I tried to do it this way

 df2 = input_df.groupby(['Fname','LastName','BirthPlace','DOB']).agg({ 'Address':'last',
                             'Desig':'last' 
                              }).reset_index()

But I get none or nulls in the output.

Expected output for me should be : (address updated and designation set as well to last non null value in those columns)

John,  Cash,21-09-1986, Darwin, 20 Howard Springs Darwin  , Supervisor

EBDS · Accepted Answer · 2021-10-13 07:25:50Z

0

This worked for me.

df.replace('',np.nan).groupby(['FN','LN'], as_index=False).agg(
    {'DOB':lambda x: x.dropna().tail(1),
        'BirthPlace': lambda x: x.dropna().tail(1), 
        'Address': lambda x: x.dropna().tail(1),
        'Desig': lambda x: x.dropna().tail(1) }
)

Pardon the use of tail(1) instead of last(). Don't know why it didn't work. But I think you get the picture. Extend your agg with a lambda function to get rid of the blanks first before doing last.

edited Oct 13, 2021 at 7:25

answered Oct 13, 2021 at 6:39

EBDS

1,8244 gold badges13 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2021-10-14 04:51:54Z

0

If there is possible defined groups by only Fname,LastName solution should be simplify by GroupBy.last for last non NaNs values:

#if necessary replace empty spaces to NaNs
#input_df = input_df.replace('', np.nan)

#if there are spaces or empty strings only
#https://stackoverflow.com/a/47810911/2901002
input_df = input_df.replace(r'^\s+$', np.nan, regex=True)     
df2 = input_df.groupby(['FName','LastName'], as_index=False).last()
print(df2)
  FName LastName         DOB BirthPlace                   Address       Desig
0  John     Cash  21-09-1986     Darwin  20 Howard Springs Darwin  Supervisor

edited Oct 14, 2021 at 4:51

answered Oct 13, 2021 at 6:09

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

2 Comments

Tabber Over a year ago

Doesn't replace.(' ', np.nan) remove rows fully, I mean I will lose other columns that have data in them ?

jezrael Over a year ago

@Tabber - No, it not remove rows, only replace if only spaces to NaN values, so last get last non NaNs.

Collectives™ on Stack Overflow

Pandas Group by multiple columns and select non null last values from non-grouped cols

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related