Manipulating csv files with Python

Question

Im trying to output the difference between 2 csv files by two columns and create a third csv file. How can I make the following code compare by columns 0 and 3.

import csv

f1 = open ("ted.csv")
oldFile1 = csv.reader(f1, delimiter=',')
oldList1 = list(oldFile1)

f2 = open ("ted2.csv")
newFile2 = csv.reader(f2, delimiter=',')
newList2 = list(newFile2)

f1.close()
f2.close()

output1 = set(tuple(row) for row in newList2 if row not in oldList1)
output2 = set(tuple(row) for row in oldList1 if row not in newList2)

with open('Michal_K.csv','w') as csvfile:
      wr = csv.writer(csvfile,delimiter=',')
      for line in (output2).difference(output1):
          wr.writerow(line)

This is the kind of thing pandas was written for. Take a look at that library! — AZhao
– AZhao, Commented Jul 19, 2015 at 18:36

Padraic Cunningham · Accepted Answer · 2015-07-19 18:47:19Z

2

If you want the rows from ted.csv that do not have any of the same third and fourth column elements as ted2, create a set of those elements from the ted2 and check each row from ted.csv before writing:

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
    r1, r2 = csv.reader(f1), csv.reader(f2)
    st = set((row[0], row[3]) for row in r1)
    wr = csv.writer(out)
    for row in (row for row in r2 if (row[0],row[3]) not in st):
          wr.writerow(row)

If you actually want something like the symmetric difference where you get unique rows from both then make a set of each third and fourth columns from both files :

from itertools import chain

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
    r1, r2 = csv.reader(f1), csv.reader(f2)
    st1 = set((row[0], row[3]) for row in r1)
    st2 = set((row[0], row[3]) for row in r2)
    f1.seek(0), f2.seek(0)
    wr = csv.writer(out)
    r1, r2 = csv.reader(f1), csv.reader(f2)
    output1 = (row for row in r1 if (row[0], row[3]) not in st2)
    output2 = (row for row in r2 if (row[0], row[3]) not in st1)
    for row in chain.from_iterable((output1, output2)):
        wr.writerow(row)

edited Jul 19, 2015 at 18:47

answered Jul 19, 2015 at 18:19

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Michal K Over a year ago

Thanks , Im after the rows that dont have the same element in row[0] and row[3] . Will still try the second approach to figure out the difference.

Padraic Cunningham Over a year ago

The second approach should give you the symmetric difference based on the first and fourth columns

Michal K Over a year ago

Second approach gives me a list index out of range.

Padraic Cunningham Over a year ago

Then you don't have at least four values in each row, add a link to the data if possible

Michal K Over a year ago

Had a good look at all the files the first two have 4 columns the third one is an empty file. Could that be the issue File "C:/testcsv/Pandaman.py", line 28, in <genexpr> st = set((row[0], row[3]) for row in r1) IndexError: list index out of range Ive also tried row[2] but same results

|

Collectives™ on Stack Overflow

Manipulating csv files with Python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related