1

I am attempting to build a python dataframe to capture an initial dataset, and then update fields as csv files containing data become available in a specified folder location. I have built my initial table and saved it to a csv, but when I reopen the table and try to bring in the next text file data I am accidentally duplicating the headers for the new data everytime I run the code. What is the best way to read in the new text files and append the data to a row based on a common column name?

My current attempt is to build the new text files into their own database, match the column headers from the initial import, and merge on the common header with the starting dataset. However, like I said the columns just keep duplicating themselves and adding _x, _y etc. afterwards.

Sample of Initial Dataframe...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,

Sample of Data to be Merged...

,Line Name,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
0,A901,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
1,A902,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
2,A903,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

I've tried merging, appending and concatenating with varying combinations of rules applied, but without any luck. Some error, others duplicate the columns...

df_proj = pd.concat([df_in,df_acc], axis=1,sort=False)

or

df_proj = pd.merge(df_in, df_acc, on='Line Name', how = 'right')

I've tried to find examples of this type of problem without luck, I be going about this the wrong way. Maybe I shouldn't be treating the new text files as their own dataframe to begin with. But any help would be greatly appreciated. Thank you!

EDIT:

The expected result would look like this...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35
2
  • Can you add expected output after processing ? Commented Oct 8, 2019 at 6:37
  • 1
    @jezrael edits added! Thanks! Commented Oct 8, 2019 at 7:19

3 Answers 3

1

Solution if need replace missing values by another DataFrame:

df = (df_in.set_index('Line Name')
           .combine_first(df_acc.set_index('Line Name'))
           .reset_index()
           .reindex(columns=df_in.columns))
print (df)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51  15594.0    3.28     588766.68     4379653.81  588901.53   
1  17:40:02  25764.0    3.02     564600.29     4350120.19  568980.53   
2  01:03:26  30615.0    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

If need concat together by Line Name with remove only NaNs columns:

df_proj = (pd.concat([df_in.set_index('Line Name').dropna(how='all', axis=1),
                     df_acc.set_index('Line Name')], axis=1, sort=False)
             .reset_index())
print (df_proj)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51    15594    3.28     588766.68     4379653.81  588901.53   
1  17:40:02    25764    3.02     564600.29     4350120.19  568980.53   
2  01:03:26    30615    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

EDIT:

For check duplicates:

print (df_in[df_in['Line Name'].duplicated(keep=False)])
print (df_acc[df_acc['Line Name'].duplicated(keep=False)])

For remove duplicates by Line Name use:

df_in = df_in.drop_duplicates('Line Name')
df_acc = df_acc.drop_duplicates('Line Name')
Sign up to request clarification or add additional context in comments.

6 Comments

Hi @jezrael, thank you for the feedback! The result you posted looks great. However I'm having a strange ValueError when I apply to the full size dataframe... ValueError: Shape of passed values is (220, 52), indices imply (219, 52) I'll definitely use this as a basis and see if I can isolate the problem but if you have any suggestions I'd appreciate it! Cheers.
@DennisWilson - one idea - are Line Name unique for both DataFrames ?
@DennisWilson - Because I think error means there is one dupe value.
@DennisWilson - test it by print (df_in[df_in['Line Name'].duplicated(keep=False)]) and print (df_acc[df_acc['Line Name'].duplicated(keep=False)])
Yes you were right, the code worked great! Thank you for the extra edits to help me troubleshoot.
|
1

lets say your starting dataframe as df_in

columns = df_in.columns
df_proj = pd.concat([df_in,df_acc[columns]], axis=0,sort=False)

2 Comments

Hi Rajith! Thanks for the suggestion. I'm getting an error using this approach.... KeyError: "['SOL X', 'LSP 0.5m', 'EOL Y', 'EOL X', 'FSP 0.5m', 'Bearing', 'SOL Y', 'Length km', 'Distance Sailed', 'LSP 1m', 'Length m', 'FSP 1m'] not in index" Any thoughts?
This because second dataframe doesnt have all the columns which are in main dataframe.you want to merge two dataframes with nan values?
0

You could do something like following:

#using append 
final_df = df_in.append(df_acc)

OR

use axis=0, meaning row wise

df_proj = pd.concat([df_in,df_acc], axis=0,sort=False)

1 Comment

Hey @dexter thanks for the reply! This gets the dataframes together, but doesn't merge on the line name. The idea is have a single row that contains the in data (planned line data) and recorded data (df_acc) together based on the 'Line Name' column.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.