3

Im trying to output the difference between 2 csv files by two columns and create a third csv file. How can I make the following code compare by columns 0 and 3.

import csv

f1 = open ("ted.csv")
oldFile1 = csv.reader(f1, delimiter=',')
oldList1 = list(oldFile1)

f2 = open ("ted2.csv")
newFile2 = csv.reader(f2, delimiter=',')
newList2 = list(newFile2)

f1.close()
f2.close()

output1 = set(tuple(row) for row in newList2 if row not in oldList1)
output2 = set(tuple(row) for row in oldList1 if row not in newList2)

with open('Michal_K.csv','w') as csvfile:
      wr = csv.writer(csvfile,delimiter=',')
      for line in (output2).difference(output1):
          wr.writerow(line)   
1
  • 2
    This is the kind of thing pandas was written for. Take a look at that library! Commented Jul 19, 2015 at 18:36

1 Answer 1

2

If you want the rows from ted.csv that do not have any of the same third and fourth column elements as ted2, create a set of those elements from the ted2 and check each row from ted.csv before writing:

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
    r1, r2 = csv.reader(f1), csv.reader(f2)
    st = set((row[0], row[3]) for row in r1)
    wr = csv.writer(out)
    for row in (row for row in r2 if (row[0],row[3]) not in st):
          wr.writerow(row)   

If you actually want something like the symmetric difference where you get unique rows from both then make a set of each third and fourth columns from both files :

from itertools import chain

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
    r1, r2 = csv.reader(f1), csv.reader(f2)
    st1 = set((row[0], row[3]) for row in r1)
    st2 = set((row[0], row[3]) for row in r2)
    f1.seek(0), f2.seek(0)
    wr = csv.writer(out)
    r1, r2 = csv.reader(f1), csv.reader(f2)
    output1 = (row for row in r1 if (row[0], row[3]) not in st2)
    output2 = (row for row in r2 if (row[0], row[3]) not in st1)
    for row in chain.from_iterable((output1, output2)):
        wr.writerow(row)
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks , Im after the rows that dont have the same element in row[0] and row[3] . Will still try the second approach to figure out the difference.
The second approach should give you the symmetric difference based on the first and fourth columns
Second approach gives me a list index out of range.
Then you don't have at least four values in each row, add a link to the data if possible
Had a good look at all the files the first two have 4 columns the third one is an empty file. Could that be the issue File "C:/testcsv/Pandaman.py", line 28, in <genexpr> st = set((row[0], row[3]) for row in r1) IndexError: list index out of range Ive also tried row[2] but same results
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.