I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. If the md5 sum does not yet exist, then add it to the list. If the sum does already exist, delete the current file.
import sys
import os
import hashlib
def checkFile(fileHashMap, file):
fReader = open(file)
fileData = fReader.read();
fReader.close()
fileHash = hashlib.md5(fileData).hexdigest()
del fileData
if fileHash in fileHashMap:
### Duplicate file.
fileHashMap[fileHash].append(file)
return True
else:
fileHashMap[fileHash] = [file]
return False
def main(argv):
fileHashMap = {}
fileCount = 0
for curDir, subDirs, files in os.walk(argv[1]):
print(curDir)
for file in files:
fileCount += 1
print("------------: " + str(fileCount))
print(curDir + file)
checkFile(fileHashMap, curDir + file)
if __name__ == "__main__":
main(sys.argv)
The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. How can I calculate the md5 sums without running into this memory error?
Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. Here was the code I ran.
import sys
import os
import hashlib
def checkFile(file):
fReader = open(file)
fileData = fReader.read();
fReader.close()
fileHash = hashlib.md5(fileData).hexdigest()
del fileData
def main(argv):
for curDir, subDirs, files in os.walk(argv[1]):
print(curDir)
for file in files:
print("------: " + str(curDir + file))
checkFile(curDir + file)
if __name__ == "__main__":
main(sys.argv)
and I still get the memory crash.
del: that should be automatic when fileData goes out of scope. What version of Python are you running? There was a memory leak in hashlib but it was quite a long time ago...