How to compare two columns in Excel using Python?

Question

I have to excel files with the following fields
file1
col1,col2,col3,col4,col5,col6,col7,col8,col9

server1,java_yes,....
server2,java_no,....
server4,java_no,....
server8,java_no,....

file2
col1,col2,col3,col4,col5,col6,col7,col8,col9

server1,java_yes,....
server3,java_no,....
server4,java_yes,....
server8,java_no,....

I want to
a. Iterate over file1
b. Compare each entry in col1 in file1 against col1 in file2
c. If it exists, I want to see if the value in file1->col2 matches the entry in file2->col2
d. If file1->col2 does not match file2->col2 then I want to update file1->col2 to equal file2->col2

Update

Running in strange issue and providing the details here. It works fine for most of the entries but for some entries it displays NaN even though the dataframe has java_yes in both places. To figure this out, I added a filter and then printed it at various stages.
When I print for df1, df2 and merged it works fine.
When I print the same at the very end, it displays NaN for certain entries Very strange.

my_filter = ( df1['col1'] == 'server1' )
print(df1.loc(my_filter, 'col2')

All except the last print returns

Yes

The very last print (for df1) returns

NaN

could you provide a minimal example of data for file1-col1 and file2-col2 as well as the expected results? To make sure we understand correctly your question. Looks like you can easily do it using pandas — thmslmr
– thmslmr, Commented Mar 17, 2023 at 22:50
What if col1 value in file1 does not exist in col1 in file2? Should it be left as it is? — cottontail
– cottontail, Commented Mar 17, 2023 at 23:12
@cottontail that is correct; file1 is my refence so if a matching entry does not exist in file2 leave it as is. — user1074593
– user1074593, Commented Mar 17, 2023 at 23:27

thmslmr · Accepted Answer · 2023-03-19 10:25:15Z

1

You can achieve that using pandas:

First, read the files using pd.read_excel (or pd.read_csv)

import pandas as pd

df1 = pd.read_excel("path/to/file1.xlsx")
df2 = pd.read_excel("path/to/file2.xlsx")

From the example you provide, you should have something like that:

df1

	col1	col2
0	server1	java_yes
1	server2	java_no
2	server4	java_no
3	server8	java_no

df2

	col1	col2
0	server1	java_yes
1	server3	java_no
2	server4	java_yes
3	server8	java_no

Now merge df2 into df1 on col1 in left mode, and overwrite df1["col2"] accordingly

merged =  df1.merge(df2, on="col1", how="left")
df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])

Resulting df1 is:

	col1	col2
0	server1	java_yes
1	server2	java_no
2	server4	java_yes
3	server8	java_no

EDIT: explaining the merge part

merged =  df1.merge(df2, on="col1", how="left")

This line merges df2 on df1 based on the values in "col1" column.

how="left" is used to specify that we want to keep all col1 values from df1, even the ones that don't exist in df2. I'll let you check the DataFrame.merge doc for more details.

The same column names in df1 and df2 will be renamed with the default suffix: _x and _y.

For the rows where the col1 value does not exist in df2, the values in the other columns will be NaN.

Here is what merged looks like:

	col1	col2_x	col2_y
0	server1	java_yes	java_yes
1	server2	java_no	nan
2	server4	java_no	java_yes
3	server8	java_no	java_no

From here, we want the final col2 in df1 to be:

col2_y (ie. col2 from df2) when it's not NaN (i.e when col1 value was in df2),
otherwise col2_x (i.e col2 from df1).

In other words, we want col2_y after replacing all NaN values with the corresponding col2_x value. This is what the fillna statement does.

df1['col2'] = merged['col2_y'].fillna(merged['col2_x'])

edited Mar 19, 2023 at 10:25

answered Mar 17, 2023 at 23:31

thmslmr

1,3021 gold badge7 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user1074593 Over a year ago

thmslmr can you explain how that last line works ? I am struggling with understanding the logic.

thmslmr Over a year ago

@user1074593 I tried to explain a bit more. Tell me if it's ok for you

user1074593 Over a year ago

thmslmr strange issue. My dataset is not too big (2-4 thousand rows and some 15 columns). The issue I am encountering is that for 40 records it shows that field as NaN. When I printout merged['col_y'] and merged['col2_x'] ; both say the exact same thing "Yes". But when I print the df1['col2'] entry it says NaN. I know this is weird but was checking to see if you had some tips on how to troubleshoot.

thmslmr Over a year ago

Hi @user1074593. Can you update your question by adding this failing example (minimal data) please ?

user1074593 Over a year ago

thmslmr added issue as an "update"

Khaled Sayed · Accepted Answer · 2023-03-17 23:49:04Z

0

assuming that you have file called workbook.xlsx containing 2 sheets (i.e. sheet1, sheet2) firstly you can access it using code like this..

import pandas as pd
df1 = pd.read_excel("..\workbook.xlsx", sheet_name= "sheet1")
df2 = pd.read_excel("..\workbook.xlsx", sheet_name= "sheet2")

now df1 represents the first sheet, df2 represent the second sheet.

you can iterate through df1 on a column name "col1" to check the condition and update your new data frames using this code..

for i in range(len(df1["col1"])):
    if (df1["col1"][i] == df2["col1"][i]) and (df1["col2"][i] != df2["col2"][i]):
        df1.at[i,"col2"] = df2["col2"][i]

But this will check the associate value on the same row number only. if you need to check if the Sheet1->col1 value exists in any of Sheet2->col1 values you can use this loop instead will achieve the same result.

for i in range(len(df1["col1"])):
    if (df1["col1"][i] in df2["col1"].values):
        j = np.where(df2["col1"] == df1["col1"][i])[0]
        df1.at[i,"col2"] = df2["col2"][j]

Finally to store your result into a new excel workbook you can use..

with pd.ExcelWriter('New_result.xlsx') as writer:
    df1.to_excel(writer, sheet_name="Sheet1")
    df2.to_excel(writer, sheet_name="Sheet2")

This will guarantee you to match all values from Sheet1->col2 with Sheet2->col2 as long as Sheet1->col1 == Sheet2->col1

edited Mar 17, 2023 at 23:49

answered Mar 17, 2023 at 23:06

Khaled Sayed

707 bronze badges

2 Comments

user1074593 Over a year ago

Sheet1->col1 will not always match Sheet2->col1; this is where I am struggling because the only way to do this is to iterate over the entire "Sheet2->col1" for every single entry in "Sheet1->col1" and that takes forever.

Khaled Sayed Over a year ago

Ok I got you now, I've updated the answer to fit your question.

Collectives™ on Stack Overflow

How to compare two columns in Excel using Python?

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related