0

I have a CSV file like:

item1,item2 
A,B
B,C
C,D
E,F

I want to compare this two column and find the similar content from the two columns item1 and item2. The output should be like this:

 item 
  B
  C

I have tried this code

with open('output/id.csv', 'r') as csvfile:
csvreader = csv.reader(csvfile)

for line in csvreader:
    if (line[0] == line[1]):
        print line
    else:
        print("not match")

I am new to programming. I don't know what the logic should be and how to solve this problem. please help.

5
  • There's an obvious IndentationError in your code; is that all you're asking about? Commented Apr 5, 2018 at 8:27
  • If not: What's wrong with the code you tried? Does it give the wrong output? Does it raise an exception? (If so, copy and paste it.) Does it seem to take way to long? Your code sample is perfect (except for that indentation problem, if that's not in your real code), but a minimal reproducible example usually needs more than just the code. Commented Apr 5, 2018 at 8:29
  • I edited my coding part. now there is no IndentationError. the output shows "not match". which is not correct. @abarnert Commented Apr 5, 2018 at 8:36
  • Your first problem is that Python indexing is 0-based, not 1-based, so you're actually comparing the second and third columns, not the first and second. You want if line[1] == line[0]:. Commented Apr 5, 2018 at 8:41
  • But it's still not going to work, except to find cases where the matching values happen to be in the same row. (As Jean-François Fabre already explained nicely.) Commented Apr 5, 2018 at 8:42

3 Answers 3

2

You need to:

  1. Use '\t' as your delimiter, as your file is delimited by tabs, not commas
  2. Get all the items from both lists as a set, then get the intersection of the two sets
  3. Print them

Here's my implementation:

import csv
with open('output/id.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile, delimiter='\t')

    items_in_1 = set()
    items_in_2 = set()

    for line in csvreader:
        items_in_1.add(line[0])
        items_in_2.add(line[1])

    items_in_both = items_in_1.intersection(items_in_2)

    print("item")
    for item in items_in_both:
        print(item)
Sign up to request clarification or add additional context in comments.

Comments

2

I would recommend you use the pandas library, this will load your csv file into a nice dataframe data structure. Really convenient.

import pandas as pd

df = pd.read_csv(filename)

Then you can get the similarities between both columns by doing

set(df['col1']) & set(df['col2'])

To get the output shaped the way you describe you can then make a new DataFrame with this intersected information as

df2 = pd.DataFrame(data = {'item': list(set(df['col1']) & set(df['col2']))})

For example,

import pandas as pd
d = {'col1': [1, 2, 6, 4, 3], 'col2': [3, 2, 5, 6, 8]}
df = pd.DataFrame(data=d)
set(df['col1']) & set(df['col2'])

{2, 3, 6}

Comments

1

You cannot succeed by reading row by rows. You have to work on the columns.

Read both columns of your csv file (without the title) into 2 python sets.

Perform sorted intersection and write back to another csv file:

import csv

with open("test.csv") as f:
    cr = csv.reader(f)
    next(cr) # skip title
    col1 = set()
    col2 = set()
    for a,b in cr:
        col1.add(a)
        col2.add(b)

with open("output.csv","w",newline="") as f:
    cw = csv.writer(f)
    cw.writerow(["item"])
    cw.writerows(sorted(col1 & col2))

with test.csv as:

item1,item2
A,B
B,C
C,D
E,F

you get

item
B
C

note: if your csv file has more than 2 columns, the unpack doesn't work properly, adapt like this:

for row in cr:
    col1.add(row[0])
    col2.add(row[1])

1 Comment

I am getting this error :for a,b in cr: ValueError: too many values to unpack @Jean-François Fabre

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.