Merging Dataframe1 rows into Dataframe2 using existing column headers - Python Pandas

Question

I am attempting to build a python dataframe to capture an initial dataset, and then update fields as csv files containing data become available in a specified folder location. I have built my initial table and saved it to a csv, but when I reopen the table and try to bring in the next text file data I am accidentally duplicating the headers for the new data everytime I run the code. What is the best way to read in the new text files and append the data to a row based on a common column name?

My current attempt is to build the new text files into their own database, match the column headers from the initial import, and merge on the common header with the starting dataset. However, like I said the columns just keep duplicating themselves and adding _x, _y etc. afterwards.

Sample of Initial Dataframe...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,

Sample of Data to be Merged...

,Line Name,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
0,A901,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
1,A902,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
2,A903,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

I've tried merging, appending and concatenating with varying combinations of rules applied, but without any luck. Some error, others duplicate the columns...

df_proj = pd.concat([df_in,df_acc], axis=1,sort=False)

or

df_proj = pd.merge(df_in, df_acc, on='Line Name', how = 'right')

I've tried to find examples of this type of problem without luck, I be going about this the wrong way. Maybe I shouldn't be treating the new text files as their own dataframe to begin with. But any help would be greatly appreciated. Thank you!

EDIT:

The expected result would look like this...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

Can you add expected output after processing ?

jezrael
– jezrael

2019-10-08 06:37:05 +00:00
Commented Oct 8, 2019 at 6:37 — jezrael
– jezrael, Commented Oct 8, 2019 at 6:37
@jezrael edits added! Thanks!

dwilson
– dwilson

2019-10-08 07:19:30 +00:00
Commented Oct 8, 2019 at 7:19 — dwilson
– dwilson, Commented Oct 8, 2019 at 7:19

jezrael · Accepted Answer · 2019-10-08 07:31:06Z

1

Solution if need replace missing values by another DataFrame:

df = (df_in.set_index('Line Name')
           .combine_first(df_acc.set_index('Line Name'))
           .reset_index()
           .reindex(columns=df_in.columns))
print (df)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51  15594.0    3.28     588766.68     4379653.81  588901.53   
1  17:40:02  25764.0    3.02     564600.29     4350120.19  568980.53   
2  01:03:26  30615.0    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

If need concat together by Line Name with remove only NaNs columns:

df_proj = (pd.concat([df_in.set_index('Line Name').dropna(how='all', axis=1),
                     df_acc.set_index('Line Name')], axis=1, sort=False)
             .reset_index())
print (df_proj)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51    15594    3.28     588766.68     4379653.81  588901.53   
1  17:40:02    25764    3.02     564600.29     4350120.19  568980.53   
2  01:03:26    30615    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

EDIT:

For check duplicates:

print (df_in[df_in['Line Name'].duplicated(keep=False)])
print (df_acc[df_acc['Line Name'].duplicated(keep=False)])

For remove duplicates by Line Name use:

df_in = df_in.drop_duplicates('Line Name')
df_acc = df_acc.drop_duplicates('Line Name')

edited Oct 8, 2019 at 7:31

answered Oct 8, 2019 at 6:27

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

dwilson Over a year ago

Hi @jezrael, thank you for the feedback! The result you posted looks great. However I'm having a strange ValueError when I apply to the full size dataframe... ValueError: Shape of passed values is (220, 52), indices imply (219, 52) I'll definitely use this as a basis and see if I can isolate the problem but if you have any suggestions I'd appreciate it! Cheers.

jezrael Over a year ago

@DennisWilson - one idea - are Line Name unique for both DataFrames ?

jezrael Over a year ago

@DennisWilson - Because I think error means there is one dupe value.

jezrael Over a year ago

@DennisWilson - test it by print (df_in[df_in['Line Name'].duplicated(keep=False)]) and print (df_acc[df_acc['Line Name'].duplicated(keep=False)])

dwilson Over a year ago

Yes you were right, the code worked great! Thank you for the extra edits to help me troubleshoot.

|

Rajith Thennakoon · Accepted Answer · 2019-10-08 06:36:02Z

1

lets say your starting dataframe as df_in

columns = df_in.columns
df_proj = pd.concat([df_in,df_acc[columns]], axis=0,sort=False)

answered Oct 8, 2019 at 6:36

Rajith Thennakoon

4,1402 gold badges16 silver badges25 bronze badges

2 Comments

dwilson Over a year ago

Hi Rajith! Thanks for the suggestion. I'm getting an error using this approach.... KeyError: "['SOL X', 'LSP 0.5m', 'EOL Y', 'EOL X', 'FSP 0.5m', 'Bearing', 'SOL Y', 'Length km', 'Distance Sailed', 'LSP 1m', 'Length m', 'FSP 1m'] not in index" Any thoughts?

Rajith Thennakoon Over a year ago

This because second dataframe doesnt have all the columns which are in main dataframe.you want to merge two dataframes with nan values?

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

You could do something like following:

#using append 
final_df = df_in.append(df_acc)

OR

use axis=0, meaning row wise

df_proj = pd.concat([df_in,df_acc], axis=0,sort=False)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Oct 8, 2019 at 6:32

oreopot

3,4592 gold badges23 silver badges31 bronze badges

1 Comment

dwilson Over a year ago

Hey @dexter thanks for the reply! This gets the dataframes together, but doesn't merge on the line name. The idea is have a single row that contains the in data (planned line data) and recorded data (df_acc) together based on the 'Line Name' column.

Collectives™ on Stack Overflow

Merging Dataframe1 rows into Dataframe2 using existing column headers - Python Pandas

3 Answers 3

6 Comments

2 Comments

OR

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

OR

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related