2

I'm running Python 3.5.1 on Windows. I am attempting to find duplicate source code files in a directory by computing their hash. The problem is that Python seems to think some files are empty. Here is the relevant code snippet:

with open(path, 'rb') as afile:
    hasher = hashlib.md5()
    data = afile.read()
    hasher.update(data)
    print("len(data): {}, Path: {}, Hash:{}".format(len(data), path, hasher.hexdigest()))

Here is some example output:

len(data): 0, Path: h:\t\TCPServerSocket.h, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 0, Path: h:\t\TCPSocket.cpp, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 0, Path: h:\t\TCPSocket.h, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 5073, Path: h:\t\ConfigFile.cpp, Hash:6188d6a0e0bc02edf27ce232689beff6

I assure you that these files are not empty, and Python is not throwing any errors during execution. Any ideas?

6
  • 1
    the path has the wrong slashes (Windows!) so stuff might get escaped. Do you use the os.path functions? Commented May 26, 2016 at 21:30
  • Hi, yes, I am using the os.path functions. Python is accessing the files fine, it just thinks that they are empty. I can open the files in an editor without issue as well. Commented May 26, 2016 at 21:33
  • Are you sure that is the code you're actually running? Your print statement has data: , but the output is len(data): . Commented May 26, 2016 at 21:46
  • Hi John. Yes, it runs. Copying the code and output to SO got out of sync. I have updated the post to be accurate. Commented May 27, 2016 at 13:19
  • I notice that is the md5 for an empty string, so go back before that. Check existence and size of the "path" variable before you open, maybe? Commented May 27, 2016 at 13:34

2 Answers 2

2

I'll just delete this answer if it is not the case, but it's something you need to check. Put this directly before the open block

print("the path is {!r}".format(path))
print("path exists: ", os.path.exists(path))
print("it is a file: ", os.path.isfile(path))
print("file size is: ", os.path.getsize(path))

Because everything in your output is consistent with that file actually being empty. So maybe it is? My first thought was you might be zeroing out the file elsewhere, although you would figure that out pretty quickly.

Sign up to request clarification or add additional context in comments.

2 Comments

file size is: 0 for these files. I went back, erased the source, re-checked out the files, and re-loaded the files in my editor. Files appeared normal. Ran some other scripts in the tool chain and, wa la, the files are zero'd out. Yet, the editor retains the original file even upon reopening. I truly do not understand this behavior, but thank you for helping me to diagnose it.
The editor probably has auto-restore from a temp file. Look for opening those files elsewhere with the wrong flags (write instead of read). Easy mistake to make.
-1

I think you should computer the hash by calling hashlib.md5 on the files them self

import hashlib
hashlib.md5("filename").hexdigest()

Let me know if that continues to suggest files are empty

1 Comment

I think your code only hashes the file name, but I want to hash the file contents (sorry if that was unclear).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.