Python Pandas: merge, join, concat

Question

I have a dataframe that has a non-unique GEO_ID, and an attribute (FTYPE) in a separate column (1 of 6 values) for each GEO_ID and an associated length for each FTYPE.

df

    FID GEO_ID  FTYPE   Length_km

0   1400000US06001400100    428 3.291467766

1   1400000US06001400100    460 7.566487367

2   1400000US06001401700    460 0.262190266

3   1400000US06001401700    566 10.49899202

4   1400000US06001403300    428 0.138171389

5   1400000US06001403300    558 0.532913513

How do I make 6 new columns for FTYPE (with 1 and 0 to indicate if that row has the FTYPE) and 6 new columns for FTYPE_Length to make each row have a unique GEO_ID?

I want my new dataframe to have a structure like this (with 6 FTYPE-s):

FID GEO_ID  FTYPE_428   FTYPE_428_length    FTYPE_460   FTYPE_460_length
0   1400000US06001400100    1   3.291467766 1   7.566487367

So far, what I have tried is doing something like this:

import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']

But this approach is problematic because it reduces the rows to the ones that have the first two FTYPE-s. Is there a way to merge with multiple columns at once?

Its probably easier to write a for loop and go over each row and use a condition to fill in the values like this:

nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
    df[str(x)] = None
    df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
    #print geoid
    for x in nhd:
        df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1

But this takes too much time and there is probably a one liner in Pandas to do the same thing.

Any help on this is appreciated!

Thanks, Solomon

DSM · Accepted Answer · 2017-12-09 05:07:57Z

I don't quite see the point of your _length columns: they seem to have the same information as just whether or not the matching value is null or not, which makes them redundant. They're easy enough to create, though.

While we could cram this into one line if we insisted, what's the point? This is SO, not codegolf. So I might do something like:

df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)

has_value = df.notnull().astype(int)
has_value.columns += '_length'

final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')

which gives me (using your input data, which only has 5 distinct FTYPEs):

In [49]: final
Out[49]: 
                      FTYPE_334  FTYPE_334_length  FTYPE_428  \
GEO_ID                                                         
1400000US06001400100        NaN                 0   3.291468   
1400000US06001401700        NaN                 0        NaN   
1400000US06001403300        NaN                 0   0.138171   
1400000US06001403400    0.04308                 1        NaN   

                      FTYPE_428_length  FTYPE_460  FTYPE_460_length  \
GEO_ID                                                                
1400000US06001400100                 1   7.566487                 1   
1400000US06001401700                 0   0.262190                 1   
1400000US06001403300                 1        NaN                 0   
1400000US06001403400                 0        NaN                 0   

                      FTYPE_558  FTYPE_558_length  FTYPE_566  FTYPE_566_length  
GEO_ID                                                                          
1400000US06001400100        NaN                 0        NaN                 0  
1400000US06001401700        NaN                 0  10.498992                 1  
1400000US06001403300   0.532914                 1   1.518864                 1  
1400000US06001403400        NaN                 0        NaN                 0

Great! Thanks for the quick answer. You're right, the length column can be dropped because the information is redundant.

Collectives™ on Stack Overflow

Python Pandas: merge, join, concat

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related