0

I have to excel files with the following fields
file1
col1,col2,col3,col4,col5,col6,col7,col8,col9

server1,java_yes,....
server2,java_no,....
server4,java_no,....
server8,java_no,....

file2
col1,col2,col3,col4,col5,col6,col7,col8,col9

server1,java_yes,....
server3,java_no,....
server4,java_yes,....
server8,java_no,....

I want to
a. Iterate over file1
b. Compare each entry in col1 in file1 against col1 in file2
c. If it exists, I want to see if the value in file1->col2 matches the entry in file2->col2
d. If file1->col2 does not match file2->col2 then I want to update file1->col2 to equal file2->col2

Update

Running in strange issue and providing the details here. It works fine for most of the entries but for some entries it displays NaN even though the dataframe has java_yes in both places. To figure this out, I added a filter and then printed it at various stages.
When I print for df1, df2 and merged it works fine.
When I print the same at the very end, it displays NaN for certain entries Very strange.

my_filter = ( df1['col1'] == 'server1' )
print(df1.loc(my_filter, 'col2')

All except the last print returns

Yes

The very last print (for df1) returns

NaN
5
  • 2
    You can use pandas.read_excel Commented Mar 17, 2023 at 22:35
  • 1
    could you provide a minimal example of data for file1-col1 and file2-col2 as well as the expected results? To make sure we understand correctly your question. Looks like you can easily do it using pandas Commented Mar 17, 2023 at 22:50
  • Added sample values @thmslmr Commented Mar 17, 2023 at 23:03
  • What if col1 value in file1 does not exist in col1 in file2? Should it be left as it is? Commented Mar 17, 2023 at 23:12
  • @cottontail that is correct; file1 is my refence so if a matching entry does not exist in file2 leave it as is. Commented Mar 17, 2023 at 23:27

2 Answers 2

1

You can achieve that using pandas:

First, read the files using pd.read_excel (or pd.read_csv)

import pandas as pd

df1 = pd.read_excel("path/to/file1.xlsx")
df2 = pd.read_excel("path/to/file2.xlsx")

From the example you provide, you should have something like that:

df1

col1 col2
0 server1 java_yes
1 server2 java_no
2 server4 java_no
3 server8 java_no

df2

col1 col2
0 server1 java_yes
1 server3 java_no
2 server4 java_yes
3 server8 java_no

Now merge df2 into df1 on col1 in left mode, and overwrite df1["col2"] accordingly

merged =  df1.merge(df2, on="col1", how="left")
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])

Resulting df1 is:

col1 col2
0 server1 java_yes
1 server2 java_no
2 server4 java_yes
3 server8 java_no

EDIT: explaining the merge part

merged =  df1.merge(df2, on="col1", how="left")

This line merges df2 on df1 based on the values in "col1" column.

how="left" is used to specify that we want to keep all col1 values from df1, even the ones that don't exist in df2. I'll let you check the DataFrame.merge doc for more details.

The same column names in df1 and df2 will be renamed with the default suffix: _x and _y.

For the rows where the col1 value does not exist in df2, the values in the other columns will be NaN.

Here is what merged looks like:

col1 col2_x col2_y
0 server1 java_yes java_yes
1 server2 java_no nan
2 server4 java_no java_yes
3 server8 java_no java_no

From here, we want the final col2 in df1 to be:

  • col2_y (ie. col2 from df2) when it's not NaN (i.e when col1 value was in df2),
  • otherwise col2_x (i.e col2 from df1).

In other words, we want col2_y after replacing all NaN values with the corresponding col2_x value. This is what the fillna statement does.

df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])
Sign up to request clarification or add additional context in comments.

5 Comments

thmslmr can you explain how that last line works ? I am struggling with understanding the logic.
@user1074593 I tried to explain a bit more. Tell me if it's ok for you
thmslmr strange issue. My dataset is not too big (2-4 thousand rows and some 15 columns). The issue I am encountering is that for 40 records it shows that field as NaN. When I printout merged['col_y'] and merged['col2_x'] ; both say the exact same thing "Yes". But when I print the df1['col2'] entry it says NaN. I know this is weird but was checking to see if you had some tips on how to troubleshoot.
Hi @user1074593. Can you update your question by adding this failing example (minimal data) please ?
thmslmr added issue as an "update"
0

assuming that you have file called workbook.xlsx containing 2 sheets (i.e. sheet1, sheet2) firstly you can access it using code like this..

import pandas as pd
df1 = pd.read_excel("..\workbook.xlsx", sheet_name= "sheet1")
df2 = pd.read_excel("..\workbook.xlsx", sheet_name= "sheet2")

now df1 represents the first sheet, df2 represent the second sheet.

you can iterate through df1 on a column name "col1" to check the condition and update your new data frames using this code..

for i in range(len(df1["col1"])):
    if (df1["col1"][i] == df2["col1"][i]) and (df1["col2"][i] != df2["col2"][i]):
        df1.at[i,"col2"] = df2["col2"][i]

But this will check the associate value on the same row number only. if you need to check if the Sheet1->col1 value exists in any of Sheet2->col1 values you can use this loop instead will achieve the same result.

for i in range(len(df1["col1"])):
    if (df1["col1"][i] in df2["col1"].values):
        j = np.where(df2["col1"] == df1["col1"][i])[0]
        df1.at[i,"col2"] = df2["col2"][j]

Finally to store your result into a new excel workbook you can use..

with pd.ExcelWriter('New_result.xlsx') as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

This will guarantee you to match all values from Sheet1->col2 with Sheet2->col2 as long as Sheet1->col1 == Sheet2->col1

2 Comments

Sheet1->col1 will not always match Sheet2->col1; this is where I am struggling because the only way to do this is to iterate over the entire "Sheet2->col1" for every single entry in "Sheet1->col1" and that takes forever.
Ok I got you now, I've updated the answer to fit your question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.