All the examples here take in account only timestamp, lenght and name. That is for sure not enough.
Imagine this example
You have two files:
c:\test_path\test.txt and c:\test_path\temp\text.txt.
The first one contains 12345. The second contains 54321. In this case these files will be considered identical even when they are not.
I have create a duplicate checker based on hash calculation. It was created right now from my head so it is rather crude (but I think you get the idea and it will be easy to optimize):
Edit I've decided the source code was "too crude" (nick name for incorrect) and I have improved it (removed superfluous code):
# The current directory where the script is executed
$path = (Resolve-Path .\).Path
$hash_details = @{}
$duplicities = @{}
# Remove unique record by size (different size = different hash)
# You can select only those you need with e.g. "*.jpg"
$file_names = Get-ChildItem -path $path -Recurse -Include "*.*" | ? {( ! $_.PSIsContainer)} | Group Length | ? {$_.Count -gt 1} | Select -Expand Group | Select FullName, Length
# I'm using SHA256 due to SHA1 collisions found
$hash_details = ForEach ($file in $file_names) {
Get-FileHash -Path $file.Fullname -Algorithm SHA256
}
# just counter for the Hash table key
$counter = 0
ForEach ($first_file_hash in $hash_details) {
ForEach ($second_file_hash in $hash_details) {
If (($first_file_hash.hash -eq $second_file_hash.hash) -and ($first_file_hash.path -ne $second_file_hash.path)) {
$duplicities.add($counter, $second_file_hash)
$counter += 1
}
}
}
##Throw output with duplicity files
If ($duplicities.count -gt 0) {
#Write-Output $duplicities.values
Write-Output "Duplicate files found:" $duplicities.values.Path
$duplicities.values | Out-file -Encoding UTF8 duplicate_log.txt
} Else {
Write-Output 'No duplicities found'
}
I have created a test structure:
PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> Get-ChildItem -path $path -Recurse
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 9.4.2018 9:58 test
-a--- 9.4.2018 11:06 2067 check_for_duplicities.ps1
-a--- 9.4.2018 11:06 757 duplicate_log.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 9.4.2018 9:58 identical_file
d---- 9.4.2018 9:56 t
-a--- 9.4.2018 9:55 5 test.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 9.4.2018 9:55 5 test.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\t
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 9.4.2018 9:55 5 test.txt
(Where file in the ..\duplicities\test\t is different from the others).
The result of the running script.
The console output:
PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> .\check_for_duplicities.ps1
Duplicate files found:
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt
The duplicate_log.txt file contains more detailed information:
Algorithm Hash Path
--------- ---- ----
SHA256 5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5 C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
SHA256 5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5 C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt
Conclusion
As you see the different file is correctly omitted from the result set.