0

Hi I have an excel sheet with fields like this

FName, LastName, DOB, BirthPlace, Address, Design

John,  Cash,          21-09-1986, Darwin, , , 
John,  Cash,             , ,   , 22 Howard Springs Darwin, 
John,  Cash,        ,          , 20 Howard Springs Darwin     , Supervisor

So I want to groupby Fname, LastName, DOB, Birth Place but then I want pick only "last" non null fields off of other columns like Address and Desig. Also please note that I won't get all the col values in every row though there can be many updates for the same person. Like in 2nd row above I don't have DOB, BirthPlace, Desig. I can have many such rows with only one col updated each item but names and dob fields present most of the time.

I tried to do it this way

 df2 = input_df.groupby(['Fname','LastName','BirthPlace','DOB']).agg({ 'Address':'last',
                             'Desig':'last' 
                              }).reset_index()

But I get none or nulls in the output.

Expected output for me should be : (address updated and designation set as well to last non null value in those columns)

John,  Cash,21-09-1986, Darwin, 20 Howard Springs Darwin  , Supervisor 

2 Answers 2

0

This worked for me.

df.replace('',np.nan).groupby(['FN','LN'], as_index=False).agg(
    {'DOB':lambda x: x.dropna().tail(1),
        'BirthPlace': lambda x: x.dropna().tail(1), 
        'Address': lambda x: x.dropna().tail(1),
        'Desig': lambda x: x.dropna().tail(1) }
)

enter image description here

Pardon the use of tail(1) instead of last(). Don't know why it didn't work. But I think you get the picture. Extend your agg with a lambda function to get rid of the blanks first before doing last.

Sign up to request clarification or add additional context in comments.

Comments

0

If there is possible defined groups by only Fname,LastName solution should be simplify by GroupBy.last for last non NaNs values:

#if necessary replace empty spaces to NaNs
#input_df = input_df.replace('', np.nan)

#if there are spaces or empty strings only
#https://stackoverflow.com/a/47810911/2901002
input_df = input_df.replace(r'^\s+$', np.nan, regex=True)     
df2 = input_df.groupby(['FName','LastName'], as_index=False).last()
print(df2)
  FName LastName         DOB BirthPlace                   Address       Desig
0  John     Cash  21-09-1986     Darwin  20 Howard Springs Darwin  Supervisor

2 Comments

Doesn't replace.(' ', np.nan) remove rows fully, I mean I will lose other columns that have data in them ?
@Tabber - No, it not remove rows, only replace if only spaces to NaN values, so last get last non NaNs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.