2

I am trying to make a python script to find duplicate files in a usb flash drive.

The proccess I am following is creating a list of the file names, hashing each file, then creating an inverse dictionary. However somewhere in the proccess I am getting a UnicodeDecodeError. Could someone help me understand what's going on?

from os import listdir
from os.path import isfile, join
from collections import defaultdict
import hashlib

my_path = r"F:/"

files_in_dir = [ file for file in listdir(my_path) if isfile(join(my_path, file)) ]
file_hashes = dict()

for file in files_in_dir:
    file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()

inverse_dict = defaultdict(list)

for file, file_hash in file_hashes.iteritems():
    inverse_dict[file_hash].append(file)

inverse_dict.items()

The error that I face is:

Traceback (most recent call last):
  File "C:\Users\Fotis\Desktop\check_dup.py", line 12, in <module>
    file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()
  File "C:\Python33\lib\encodings\cp1253.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 2227: character maps to <undefined>
1
  • @Martijn Pieters It's python 3. I will retag the question appropriatelly. Commented Dec 3, 2012 at 18:12

1 Answer 1

5

You are trying to read a file that is not encoded in the default platform encoding (cp1253). By opening the file in text mode (r) Python 3 will try and decode the file contents to unicode. You didn't specify an encoding, so the platform preferred encoding is used.

Open the files in binary mode instead, using rb as the mode. Since you are only calculating the MD5 hash (a function that expects bytes), you should not be using text mode anyway.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.