1

This was a question that appeared in a Python coding competition and was wondering how this can be achieved.

Problem statement:

You have two directories(with possible subdirectories in it). Your script should find out duplicate files by comparing contents of the same filenames in the two root directories

Result: FAIL: If contents of atleast one same filename differs

PASS: Otherwise

Here's a sample figure

 /dir1                       /dir2
       -- file1                   -- file1 
       -- file2                   -- fileA  
       -- file3                   -- fileB   
       -- ....
       -- ...
       ---/subDir1
            --file1
            --file2

file1 of dir1 contains :- foo bar
file1  of dir2 contains :- foo 
Result - Fail

file1  of dir1 contains :- foo bar
file1  of dir2 contains :- foo bar
Result - Pass.

I attempted using hashing by file size, but it was obviously not the way :)

PS: Any scripting language can be used.

Thanks Kelly

1
  • 1
    I was going to explain a method but I think it's better to just refer you to the complete program ssokolow.com/scripts/fastdupes.py (I didn't write it but I use a modified version of it), looking at it would be better than me attempting to explain how it works. Commented Mar 1, 2012 at 5:09

2 Answers 2

3

You can solve this in a tiered manner.

  1. Go through each dir/subdir, compare the size of the files.
  2. If the file size is different => fail
  3. Compute the SHA1 hash of each file if that does not match => fail
  4. If SHA1 hashes match do a byte by byte comparision of the contents in the file to be absolutely sure.
Sign up to request clarification or add additional context in comments.

1 Comment

Can you please provide a pseudo code? I am following the post given here but not able to get the result code.activestate.com/recipes/…
1

Have a look at the filecmp module in the standard library.

Computing hashes is not useful when each file is compared to just one other file. The entire file must be read to compute a hash, and read again to confirm a match. By contrast, a direct comparison can be aborted at first difference.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.