Powershell to display duplicate files

Question

I have a task to check if new files are imported for the day in a shared location folder and alert if any duplicate files and no recursive check needed.

Below code displays all the file details with size which are 1 day old However I need only files with the same size as I cannot compare them using name.

$Files = Get-ChildItem -Path E:\Script\test |
Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)}

$Files | Select-Object -Property Name, hash, LastWriteTime, @{N='SizeInKb';E={[double]('{0:N2}' -f ($_.Length/1kb))}}

What if two files have the same size but different names? Will that be considered as a duplicate? — Vivek Kumar Singh
– Vivek Kumar Singh, Commented Apr 5, 2018 at 7:21
as two files in a same folder cannot have the same name I am considering size to detect duplicates, Let me know if any other ideas — Teja554
– Teja554, Commented Apr 5, 2018 at 7:45
hey tukan your answer looks great , unfortunately my powershell version is 2.0 and hence am unable to use Get-FileHash. Working to call the method from local. — Teja554
– Teja554, Commented Apr 12, 2018 at 12:12

Hashbrown · Accepted Answer · 2021-11-09 07:40:32Z

32

I didn't like the big DOS-like script answer written here, so here's an idiomatic way of doing it for Powershell:

From the folder you want to find the duplicates, just run this simple set of pipes

Get-ChildItem -Recurse -File `
| Group-Object -Property Length `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group } `
| Get-FileHash `
| Group-Object -Property Hash `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group }

Which will show all files and their hashes that match other files.
Each line does the following:

get files
- from current directory (use -Path $directory otherwise)
- recursively (if not wanted, remove -Recurse)
group based on file size
discard groups with less than 2 files
grab all those files
get hashes for each
group based on hash
discard groups with less than 2 files
get all those files

Add | %{ $_.path } to just show the paths instead of the hashes.
Add | %{ $_.path -replace "$([regex]::escape($(pwd)))",'' } to only show the relative path from the current directory (useful in recursion).

For the question-asker specifically, don't forget to whack in | Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)} after the gci so you're not comparing files you don't want to consider, which might get very time-consuming if you have a lot of coincidentally same-length files in that shared folder.

Finally, if you're like me and just wanted to find dupes based on name, as google will probably take you here too:

gci -Recurse -file | Group-Object name | Where-Object { $_.Count -gt 1 } | select -ExpandProperty group | %{ $_.fullname }

edited Nov 9, 2021 at 7:40

answered Nov 3, 2019 at 5:40

Hashbrown

13.2k9 gold badges82 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

PollusB Over a year ago

This is one of the best one-line I've seen in PowerShell +1 man !

CharlesT Over a year ago

This is great, thanks! Could this be modified to get files of more than one path to check two unrelated directories for duplicates, e.g. Get-ChildItem for two-Path $directory?

Hashbrown Over a year ago

@CharlesT It should be as simple as replacing Get-ChildItem ...| with .{ gci $dir1; gci $dir2 } |. I do have another answer that works over multiple directories to find differences instead (where you can infer what the duplicates are) and would be much faster IF the paths of the duplicates are also the same

CharlesT Over a year ago

@Hashbrown That works. Exactly what I was looking for. Thanks a lot!

tukan · Accepted Answer · 2018-04-11 08:29:19Z

All the examples here take in account only timestamp, lenght and name. That is for sure not enough.

Imagine this example You have two files: c:\test_path\test.txt and c:\test_path\temp\text.txt. The first one contains 12345. The second contains 54321. In this case these files will be considered identical even when they are not.

I have create a duplicate checker based on hash calculation. It was created right now from my head so it is rather crude (but I think you get the idea and it will be easy to optimize):

Edit I've decided the source code was "too crude" (nick name for incorrect) and I have improved it (removed superfluous code):

# The current directory where the script is executed
$path = (Resolve-Path .\).Path

$hash_details = @{}
$duplicities = @{}    

# Remove unique record by size (different size = different hash)
# You can select only those you need with e.g. "*.jpg"
$file_names = Get-ChildItem -path $path -Recurse -Include "*.*" | ? {( ! $_.PSIsContainer)} | Group Length | ? {$_.Count -gt 1} | Select -Expand Group | Select FullName, Length 

# I'm using SHA256 due to SHA1 collisions found
$hash_details =  ForEach ($file in $file_names) {
                             Get-FileHash -Path $file.Fullname -Algorithm SHA256
                         }

# just counter for the Hash table key
$counter = 0
ForEach ($first_file_hash in $hash_details) {
    ForEach ($second_file_hash in $hash_details) {
        If (($first_file_hash.hash -eq $second_file_hash.hash) -and ($first_file_hash.path -ne $second_file_hash.path)) {
                $duplicities.add($counter, $second_file_hash)
                $counter += 1
        }
    }
}

##Throw output with duplicity files 
If ($duplicities.count -gt 0) { 
    #Write-Output $duplicities.values
    Write-Output "Duplicate files found:" $duplicities.values.Path
    $duplicities.values | Out-file -Encoding UTF8 duplicate_log.txt
} Else {
    Write-Output 'No duplicities found'
}

I have created a test structure:

PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> Get-ChildItem -path $path -Recurse


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----          9.4.2018      9:58            test
-a---          9.4.2018     11:06       2067 check_for_duplicities.ps1
-a---          9.4.2018     11:06        757 duplicate_log.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----          9.4.2018      9:58            identical_file
d----          9.4.2018      9:56            t
-a---          9.4.2018      9:55          5 test.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---          9.4.2018      9:55          5 test.txt


    Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\t


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---          9.4.2018      9:55          5 test.txt

(Where file in the ..\duplicities\test\t is different from the others).

The result of the running script.

The console output:

PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> .\check_for_duplicities.ps1
Duplicate files found:
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt

The duplicate_log.txt file contains more detailed information:

Algorithm       Hash                                                                   Path                                                                                              
---------       ----                                                                   ----                                                                                              
SHA256          5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5       C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt             
SHA256          5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5       C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt

Conclusion

As you see the different file is correctly omitted from the result set.

postanote · Accepted Answer · 2018-04-05 07:58:21Z

-1

Since the file contents that you are determining to be duplicate. It's more prudent to just hash files and compare the hash.

The name, size. timestamp would not be a prudent attributes for the defined use case. Since the hash would tell you if the files have the same content.

See these discussions

Need a way to check if two files are the same? Calculate a hash of the files. Here is one way to do it: https://blogs.msdn.microsoft.com/powershell/2006/04/25/duplicate-files

Duplicate File Finder and Remover

And now the moment you have been waiting for....an all PowerShell file duplicate finder and remover! Now you can clean up all those copies of pictures, music files, and videos. The script opens a file dialog box to select the target folder, recursively scans each file for duplica

https://gallery.technet.microsoft.com/scriptcenter/Duplicate-File-Finder-and-78f40ae9

answered Apr 5, 2018 at 7:58

postanote

16.4k3 gold badges17 silver badges29 bronze badges

1 Comment

tukan Over a year ago

Some code here would be helpful. If the external links die nothing will be left here.

DKU · Accepted Answer · 2018-04-05 15:03:05Z

-1

This might helpful for you.

 $files = Get-ChildItem 'E:\SC' | Where-Object {$_.CreationTime -eq (Get-Date).AddDays(-1)} | Group-Object -Property Length
foreach($filegroup in $allfiles)
{
if ($filegroup.Count -ne 1)
{
    foreach ($file in $filegroup.Group)
    {
        Invoke-Item $file.fullname
    }
}
}

answered Apr 5, 2018 at 15:03

DKU

751 silver badge8 bronze badges

4 Comments

tukan Over a year ago

How does that help with duplicities? Dates and sizes are not valid information for checking duplicities.

DKU Over a year ago

If you see the question, They are looking for the script to find out files which are same length ?

tukan Over a year ago

I read the question and the text to the question. Imagine this example you have two files (with identical timestamp) paths c:\test_path\test.txt and c:\test_path\temp\text.txt. One with 12345 contents and second with 54321 are they identical in your code or not?

tukan Over a year ago

Not mentioning, you are not commenting the code at all!

Collectives™ on Stack Overflow

Powershell to display duplicate files

4 Answers 4

4 Comments

Comments

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related