0

I have a dataframe that has a non-unique GEO_ID, and an attribute (FTYPE) in a separate column (1 of 6 values) for each GEO_ID and an associated length for each FTYPE.

df

    FID GEO_ID  FTYPE   Length_km

0   1400000US06001400100    428 3.291467766

1   1400000US06001400100    460 7.566487367

2   1400000US06001401700    460 0.262190266

3   1400000US06001401700    566 10.49899202

4   1400000US06001403300    428 0.138171389

5   1400000US06001403300    558 0.532913513

How do I make 6 new columns for FTYPE (with 1 and 0 to indicate if that row has the FTYPE) and 6 new columns for FTYPE_Length to make each row have a unique GEO_ID?

I want my new dataframe to have a structure like this (with 6 FTYPE-s):

FID GEO_ID  FTYPE_428   FTYPE_428_length    FTYPE_460   FTYPE_460_length
0   1400000US06001400100    1   3.291467766 1   7.566487367

So far, what I have tried is doing something like this:

import pandas as pd
fname = "filename.csv"
df = pd.read_csv(fname)
nhd = [334, 336, 420, 428, 460, 558, 556]
df1 = df.loc[df['FTYPE']==nhd[0]]
df2 = df.loc[df['FTYPE']==nhd[1]]
df3 = df.loc[df['FTYPE']==nhd[2]]
df4 = df.loc[df['FTYPE']==nhd[3]]
df5 = df.loc[df['FTYPE']==nhd[4]]
df6 = df.loc[df['FTYPE']==nhd[5]]
df7 = df.loc[df['FTYPE']==nhd[6]]
df12 = df1.merge(df2, how='left', left_on='GEO_ID', right_on='GEO_ID')
df23 = df12.merge(df3,how='left', left_on='GEO_ID', right_on='GEO_ID')
df34 = df23.merge(df4,how='left', left_on='GEO_ID', right_on='GEO_ID')
df45 = df34.merge(df5,how='left', left_on='GEO_ID', right_on='GEO_ID')
df56 = df45.merge(df6,how='left', left_on='GEO_ID', right_on='GEO_ID')
df67 = df56.merge(df7,how='left', left_on='GEO_ID', right_on='GEO_ID')
cols = [0,4,7,10,13,16,19]
df67.drop(df67.columns[cols],axis=1,inplace=True)
df67.columns =['GEO_ID','334','len_334','336','len_336','420','len_420','428','len_428','460','len_460','558','len_558','566','len_566']

But this approach is problematic because it reduces the rows to the ones that have the first two FTYPE-s. Is there a way to merge with multiple columns at once?

Its probably easier to write a for loop and go over each row and use a condition to fill in the values like this:

nhd = [334, 336, 420, 428, 460, 558, 556]
for x in nhd:
    df[str(x)] = None
    df["length_"+str(x)] = None
df.head()
for geoid in df["GEO_ID"]:
    #print geoid
    for x in nhd:
        df.ix[(df['FTYPE']==x) & (df['GEO_ID'] == geoid)][str(nhd)] = 1

But this takes too much time and there is probably a one liner in Pandas to do the same thing.

Any help on this is appreciated!

Thanks, Solomon

1 Answer 1

1

I don't quite see the point of your _length columns: they seem to have the same information as just whether or not the matching value is null or not, which makes them redundant. They're easy enough to create, though.

While we could cram this into one line if we insisted, what's the point? This is SO, not codegolf. So I might do something like:

df = df.pivot(index="GEO_ID", columns="FTYPE", values="Length_km")
df.columns = "FTYPE_" + df.columns.astype(str)

has_value = df.notnull().astype(int)
has_value.columns += '_length'

final = pd.concat([df, has_value], axis=1).sort_index(axis='columns')

which gives me (using your input data, which only has 5 distinct FTYPEs):

In [49]: final
Out[49]: 
                      FTYPE_334  FTYPE_334_length  FTYPE_428  \
GEO_ID                                                         
1400000US06001400100        NaN                 0   3.291468   
1400000US06001401700        NaN                 0        NaN   
1400000US06001403300        NaN                 0   0.138171   
1400000US06001403400    0.04308                 1        NaN   

                      FTYPE_428_length  FTYPE_460  FTYPE_460_length  \
GEO_ID                                                                
1400000US06001400100                 1   7.566487                 1   
1400000US06001401700                 0   0.262190                 1   
1400000US06001403300                 1        NaN                 0   
1400000US06001403400                 0        NaN                 0   

                      FTYPE_558  FTYPE_558_length  FTYPE_566  FTYPE_566_length  
GEO_ID                                                                          
1400000US06001400100        NaN                 0        NaN                 0  
1400000US06001401700        NaN                 0  10.498992                 1  
1400000US06001403300   0.532914                 1   1.518864                 1  
1400000US06001403400        NaN                 0        NaN                 0  
Sign up to request clarification or add additional context in comments.

1 Comment

Great! Thanks for the quick answer. You're right, the length column can be dropped because the information is redundant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.