9

I have a task to check if new files are imported for the day in a shared location folder and alert if any duplicate files and no recursive check needed.

Below code displays all the file details with size which are 1 day old However I need only files with the same size as I cannot compare them using name.

$Files = Get-ChildItem -Path E:\Script\test |
Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)}

$Files | Select-Object -Property Name, hash, LastWriteTime, @{N='SizeInKb';E={[double]('{0:N2}' -f ($_.Length/1kb))}}
4
  • What if two files have the same size but different names? Will that be considered as a duplicate? Commented Apr 5, 2018 at 7:21
  • as two files in a same folder cannot have the same name I am considering size to detect duplicates, Let me know if any other ideas Commented Apr 5, 2018 at 7:45
  • @Teja554 does the answer solve your problem? Commented Apr 12, 2018 at 7:40
  • hey tukan your answer looks great , unfortunately my powershell version is 2.0 and hence am unable to use Get-FileHash. Working to call the method from local. Commented Apr 12, 2018 at 12:12

4 Answers 4

32

I didn't like the big DOS-like script answer written here, so here's an idiomatic way of doing it for Powershell:

From the folder you want to find the duplicates, just run this simple set of pipes

Get-ChildItem -Recurse -File `
| Group-Object -Property Length `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group } `
| Get-FileHash `
| Group-Object -Property Hash `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group }

Which will show all files and their hashes that match other files.
Each line does the following:

  • get files
    • from current directory (use -Path $directory otherwise)
    • recursively (if not wanted, remove -Recurse)
  • group based on file size
  • discard groups with less than 2 files
  • grab all those files
  • get hashes for each
  • group based on hash
  • discard groups with less than 2 files
  • get all those files

Add | %{ $_.path } to just show the paths instead of the hashes.
Add | %{ $_.path -replace "$([regex]::escape($(pwd)))",'' } to only show the relative path from the current directory (useful in recursion).

For the question-asker specifically, don't forget to whack in | Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)} after the gci so you're not comparing files you don't want to consider, which might get very time-consuming if you have a lot of coincidentally same-length files in that shared folder.

Finally, if you're like me and just wanted to find dupes based on name, as google will probably take you here too:

gci -Recurse -file | Group-Object name | Where-Object { $_.Count -gt 1 } | select -ExpandProperty group | %{ $_.fullname }

Sign up to request clarification or add additional context in comments.

4 Comments

This is one of the best one-line I've seen in PowerShell +1 man !
This is great, thanks! Could this be modified to get files of more than one path to check two unrelated directories for duplicates, e.g. Get-ChildItem for two-Path $directory?
@CharlesT It should be as simple as replacing Get-ChildItem ...| with .{ gci $dir1; gci $dir2 } |. I do have another answer that works over multiple directories to find differences instead (where you can infer what the duplicates are) and would be much faster IF the paths of the duplicates are also the same
@Hashbrown That works. Exactly what I was looking for. Thanks a lot!
2

All the examples here take in account only timestamp, lenght and name. That is for sure not enough.

Imagine this example You have two files: c:\test_path\test.txt and c:\test_path\temp\text.txt. The first one contains 12345. The second contains 54321. In this case these files will be considered identical even when they are not.

I have create a duplicate checker based on hash calculation. It was created right now from my head so it is rather crude (but I think you get the idea and it will be easy to optimize):

Edit I've decided the source code was "too crude" (nick name for incorrect) and I have improved it (removed superfluous code):

# The current directory where the script is executed
$path = (Resolve-Path .\).Path

$hash_details = @{}
$duplicities = @{}    

# Remove unique record by size (different size = different hash)
# You can select only those you need with e.g. "*.jpg"
$file_names = Get-ChildItem -path $path -Recurse -Include "*.*" | ? {( ! $_.PSIsContainer)} | Group Length | ? {$_.Count -gt 1} | Select -Expand Group | Select FullName, Length 

# I'm using SHA256 due to SHA1 collisions found
$hash_details =  ForEach ($file in $file_names) {
                             Get-FileHash -Path $file.Fullname -Algorithm SHA256
                         }

# just counter for the Hash table key
$counter = 0
ForEach ($first_file_hash in $hash_details) {
    ForEach ($second_file_hash in $hash_details) {
        If (($first_file_hash.hash -eq $second_file_hash.hash) -and ($first_file_hash.path -ne $second_file_hash.path)) {
                $duplicities.add($counter, $second_file_hash)
                $counter += 1
        }
    }
}

##Throw output with duplicity files 
If ($duplicities.count -gt 0) { 
    #Write-Output $duplicities.values
    Write-Output "Duplicate files found:" $duplicities.values.Path
    $duplicities.values | Out-file -Encoding UTF8 duplicate_log.txt
} Else {
    Write-Output 'No duplicities found'
}

I have created a test structure:

PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> Get-ChildItem -path $path -Recurse


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----          9.4.2018      9:58            test
-a---          9.4.2018     11:06       2067 check_for_duplicities.ps1
-a---          9.4.2018     11:06        757 duplicate_log.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----          9.4.2018      9:58            identical_file
d----          9.4.2018      9:56            t
-a---          9.4.2018      9:55          5 test.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---          9.4.2018      9:55          5 test.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\t


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---          9.4.2018      9:55          5 test.txt

(Where file in the ..\duplicities\test\t is different from the others).

The result of the running script.

The console output:

PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> .\check_for_duplicities.ps1
Duplicate files found:
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt

The duplicate_log.txt file contains more detailed information:

Algorithm       Hash                                                                   Path                                                                                              
---------       ----                                                                   ----                                                                                              
SHA256          5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5       C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt             
SHA256          5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5       C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt                            

Conclusion

As you see the different file is correctly omitted from the result set.

Comments

-1

Since the file contents that you are determining to be duplicate. It's more prudent to just hash files and compare the hash.

The name, size. timestamp would not be a prudent attributes for the defined use case. Since the hash would tell you if the files have the same content.

See these discussions

Need a way to check if two files are the same? Calculate a hash of the files. Here is one way to do it: https://blogs.msdn.microsoft.com/powershell/2006/04/25/duplicate-files

Duplicate File Finder and Remover

And now the moment you have been waiting for....an all PowerShell file duplicate finder and remover! Now you can clean up all those copies of pictures, music files, and videos. The script opens a file dialog box to select the target folder, recursively scans each file for duplica

https://gallery.technet.microsoft.com/scriptcenter/Duplicate-File-Finder-and-78f40ae9

1 Comment

Some code here would be helpful. If the external links die nothing will be left here.
-1

This might helpful for you.

 $files = Get-ChildItem 'E:\SC' | Where-Object {$_.CreationTime -eq (Get-Date).AddDays(-1)} | Group-Object -Property Length
foreach($filegroup in $allfiles)
{
if ($filegroup.Count -ne 1)
{
    foreach ($file in $filegroup.Group)
    {
        Invoke-Item $file.fullname
    }
}
}

4 Comments

How does that help with duplicities? Dates and sizes are not valid information for checking duplicities.
If you see the question, They are looking for the script to find out files which are same length ?
I read the question and the text to the question. Imagine this example you have two files (with identical timestamp) paths c:\test_path\test.txt and c:\test_path\temp\text.txt. One with 12345 contents and second with 54321 are they identical in your code or not?
Not mentioning, you are not commenting the code at all!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.