Compare two column from CSV file using python

Question

I have a CSV file like:

item1,item2 
A,B
B,C
C,D
E,F

I want to compare this two column and find the similar content from the two columns item1 and item2. The output should be like this:

 item 
  B
  C

I have tried this code

with open('output/id.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)

for line in csvreader:
    if (line[0] == line[1]):
        print line
    else:
        print("not match")

I am new to programming. I don't know what the logic should be and how to solve this problem. please help.

There's an obvious IndentationError in your code; is that all you're asking about? — abarnert
– abarnert, Commented Apr 5, 2018 at 8:27
If not: What's wrong with the code you tried? Does it give the wrong output? Does it raise an exception? (If so, copy and paste it.) Does it seem to take way to long? Your code sample is perfect (except for that indentation problem, if that's not in your real code), but a minimal reproducible example usually needs more than just the code. — abarnert
– abarnert, Commented Apr 5, 2018 at 8:29
I edited my coding part. now there is no IndentationError. the output shows "not match". which is not correct. @abarnert — jan
– jan, Commented Apr 5, 2018 at 8:36
Your first problem is that Python indexing is 0-based, not 1-based, so you're actually comparing the second and third columns, not the first and second. You want if line[1] == line[0]:. — abarnert
– abarnert, Commented Apr 5, 2018 at 8:41
But it's still not going to work, except to find cases where the matching values happen to be in the same row. (As Jean-François Fabre already explained nicely.) — abarnert
– abarnert, Commented Apr 5, 2018 at 8:42

Ollie · Accepted Answer · 2018-04-05 08:34:48Z

2

You need to:

Use '\t' as your delimiter, as your file is delimited by tabs, not commas
Get all the items from both lists as a set, then get the intersection of the two sets
Print them

Here's my implementation:

import csv
with open('output/id.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')

    items_in_1 = set()
    items_in_2 = set()

    for line in csvreader:
        items_in_1.add(line[0])
        items_in_2.add(line[1])

    items_in_both = items_in_1.intersection(items_in_2)

    print("item")
    for item in items_in_both:
        print(item)

answered Apr 5, 2018 at 8:34

Ollie

1,7421 gold badge15 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

JahKnows · Accepted Answer · 2018-04-05 08:31:11Z

2

I would recommend you use the pandas library, this will load your csv file into a nice dataframe data structure. Really convenient.

import pandas as pd

df = pd.read_csv(filename)

Then you can get the similarities between both columns by doing

set(df['col1']) & set(df['col2'])

To get the output shaped the way you describe you can then make a new DataFrame with this intersected information as

df2 = pd.DataFrame(data = {'item': list(set(df['col1']) & set(df['col2']))})

For example,

import pandas as pd
d = {'col1': [1, 2, 6, 4, 3], 'col2': [3, 2, 5, 6, 8]}
df = pd.DataFrame(data=d)
set(df['col1']) & set(df['col2'])

{2, 3, 6}

answered Apr 5, 2018 at 8:31

JahKnows

2,7113 gold badges25 silver badges37 bronze badges

Comments

Jean-François Fabre · Accepted Answer · 2018-04-05 09:00:31Z

1

You cannot succeed by reading row by rows. You have to work on the columns.

Read both columns of your csv file (without the title) into 2 python sets.

Perform sorted intersection and write back to another csv file:

import csv

with open("test.csv") as f:
    cr = csv.reader(f)
    next(cr) # skip title
    col1 = set()
    col2 = set()
    for a,b in cr:
        col1.add(a)
        col2.add(b)

with open("output.csv","w",newline="") as f:
    cw = csv.writer(f)
    cw.writerow(["item"])
    cw.writerows(sorted(col1 & col2))

with test.csv as:

item1,item2
A,B
B,C
C,D
E,F

you get

item
B
C

note: if your csv file has more than 2 columns, the unpack doesn't work properly, adapt like this:

for row in cr:
    col1.add(row[0])
    col2.add(row[1])

edited Apr 5, 2018 at 9:00

answered Apr 5, 2018 at 8:31

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

1 Comment

jan Over a year ago

I am getting this error :for a,b in cr: ValueError: too many values to unpack @Jean-François Fabre

Collectives™ on Stack Overflow

Compare two column from CSV file using python

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related