7

I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. If the md5 sum does not yet exist, then add it to the list. If the sum does already exist, delete the current file.

import sys
import os
import hashlib

def checkFile(fileHashMap, file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

    if fileHash in fileHashMap:
        ### Duplicate file.
        fileHashMap[fileHash].append(file)
        return True
    else:
        fileHashMap[fileHash] = [file]
        return False


def main(argv):
    fileHashMap = {}
    fileCount = 0
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            fileCount += 1
            print("------------: " + str(fileCount))
            print(curDir + file)
            checkFile(fileHashMap, curDir + file)

if __name__ == "__main__":
    main(sys.argv)

The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. How can I calculate the md5 sums without running into this memory error?

Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. Here was the code I ran.

import sys
import os
import hashlib

def checkFile(file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

def main(argv):
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            print("------: " + str(curDir + file))
            checkFile(curDir + file)

if __name__ == "__main__":
    main(sys.argv)

and I still get the memory crash.

12
  • how many files are we talking about? Commented Sep 7, 2015 at 16:19
  • I only get through about 200 files, but there are many more. It just happens that my first 20 or so files are somewhat large. Commented Sep 7, 2015 at 16:21
  • It crashes on a relatively small file if that means anything. Commented Sep 7, 2015 at 16:21
  • I just ran this and had no memory growth. Looking at the code I can't see a problem - looks good to me. You don't even need to do the del: that should be automatic when fileData goes out of scope. What version of Python are you running? There was a memory leak in hashlib but it was quite a long time ago... Commented Sep 7, 2015 at 16:34
  • They don't. I'm storing lists in the dictionary holding the file path to all the duplicate files. Just did this so I can see all the duplicates. Didn't work in the removal yet since I want to make sure it works properly before I start deleting things. Commented Sep 7, 2015 at 16:34

3 Answers 3

5

Your problem is in reading the entire files, they're too big and your system can't load it all in memory, so then it throws the error.

As you can see in the Official Python Documentation, the MemoryError is:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C’s malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

For your purpose, you can use hashlib.md5()

In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:

def md5(fname):
    hash = hashlib.md5()
    with open(fname) as f:
        for chunk in iter(lambda: f.read(4096), ""):
            hash.update(chunk)
    return hash.hexdigest()
Sign up to request clarification or add additional context in comments.

4 Comments

I can give this a try, but my system loads a large disk image file I have early on just fine. The file is at 5 GB and it crashes when loading a small file. Would that still be related to my system loading the entire file at once since the 5gb file is my largest file.
The process has some heap and maybe you're not getting it full since you load the small file (but it's still full because it has the other loaded too). But the error you talk about is because of running out of memory. I'll edit the answer with the documentation of the error.
I have been running your modification for about 20 mins now and it got past the point I kept crashing on, so I think this may be the solution. I am going to run it a little longer just to be sure.
Okay thank you, anyway if you find some new conclusion that we can add to the answer to improve the solution for new users with the same problem please make me know it and I'll add it to the answer.
1

Not a solution to your memory problem, but an optimization that might avoid it:

  • small files: calculate md5 sum, remove duplicates

  • big files: remember size and path

  • at the end, only calculate md5sums of files of same size when there is more than one file

Python's collection.defaultdict might be useful for this.

1 Comment

Yeah I was thinking something along those lines might be an option, but I worry that if there is a memory leak, the numerous small file may still cause the issue at some point. I have a lot of GBs to process.
0

How about calling openssl command from python In both windows and Linux

$ openssl md5 "file"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.