10

I'm trying to compare two csv files (fileA and fileB), and remove any rows from fileA that are not found in fileB. I want to be able to do this without creating a third file. I thought I could do this using the csv writer module but now I'm second guessing myself.

Currently, I'm using the following code to record my comparison data from file B:

removal_list = set()
with open('fileB', 'rb') as file_b:
    reader1 = csv.reader(file_b)
    next(reader1)
    for row in reader1:
        removal_list.add((row[0], row[2]))

This is where I'm stuck and do not know how to delete the rows:

with open('fileA', 'ab') as file_a:
    with open('fileB', 'rb') as file_b:
        writer = csv.writer(file_a)
            reader2 = csv.reader(file_b)
            next(reader2)
            for row in reader2:
                if (row[0], row[2]) not in removal_list:
                # If row was not present in file B, Delete it from file A.
                #stuck here:  writer.<HowDoIRemoveRow>(row)
2
  • 1
    sqlite is a flat-file based database and the drivers for it are included in modern versions of Python. It might be a better option considering what you are trying to do. Commented Apr 29, 2013 at 5:14
  • Sorry for the silly question but this will create an exact copy of the fileB, isn't it? Commented Jul 5, 2016 at 14:40

3 Answers 3

8

This solution uses fileinput with inplace=True, which writes to a temporary file and then automatically renames it at the end to your file name. You can't remove rows from a file but you can rewrite it with only the ones you want.

if the keyword argument inplace=1 is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently). This makes it possible to write a filter that rewrites its input file in place.

fileA

h1,h2,h3
a,b,c
d,e,f
g,h,i
j,k,l

fileB

h1,h2,h3
a,b,c
1,2,3
g,h,i
4,5,6

import fileinput, sys, csv

with open('fileB', 'rb') as file_b:
    r = csv.reader(file_b)
    next(r) #skip header
    seen = {(row[0], row[2]) for row in r}

f = fileinput.input('fileA', inplace=True) # sys.stdout is redirected to the file
print next(f), # write header as first line

w = csv.writer(sys.stdout) 
for row in csv.reader(f):
   if (row[0], row[2]) in seen: # write it if it's in B
       w.writerow(row)

fileA

h1,h2,h3
a,b,c    
g,h,i
Sign up to request clarification or add additional context in comments.

6 Comments

A subtle improvement not addressed in the explanation: this code uses a set, a far more optimal data structure for answering "is this data present?" than a list (which must be iterated over each time).
@David Op also used a set though
D'oh. S/he clearly did. Well, small bit of advice- don't call it a removal "list", or bone-headed people like me will get confused as to the variable's type. =)
what version of python? I don't believe this syntax is 2.4 compatible
@justin You tagged it as 2.7? You can just use set((row[0], row[1]) for row in r) instead
|
3

CSV is not a database format. It is read and written as a whole. You can't remove rows in the middle. So the only way to do this without creating a third file is to read in the file completely in memory and then write it out, without the offending rows.

But in general it's better to use a third file.

Comments

3

As Lennart described, you can't modify a CSV file in-place as you iterate over it.

If you're really opposed to creating a third file, you might want to look into using a string buffer with StringIO, the idea being that you build up the new desired contents of file A in memory. At the end of your script, you can write the contents of the buffer over file A.

from cStringIO import StringIO


with open('fileB', 'rb') as file_b:
    new_a_buf = StringIO()
    writer = csv.writer(new_a_buf)
    reader2 = csv.reader(file_b)
    next(reader2)
    for row in reader2:
        if (row[0], row[2]) not in removal_list:
            writer.writerow(row)

# At this point, the contents (new_a_buf) exist in memory
with open('fileA', 'wb') as file_a:
    file_a.write(new_a_buf.getvalue())

3 Comments

A word of caution here: you may exhaust the available memory for your system if your input files are large.
You may as well just write to a different file and rename it at the end, that is what my solution does
@jamylak, I completely agree with you. And that's exactly what I would do in this situation. I just figured this would be useful in that in technically meets what the asker is looking for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.