11

I have several thousand duplicate files (jar files as an example) that I'd like to use powershell to

  1. Search through the file system recursively
  2. Find the dups (either by name only or a checksum method or both)
  3. Delete all duplicates but one.

I'm new to powershell and am throwing this out there to the PS folks that might be able to help.

5 Answers 5

17

try this:

ls *.txt -recurse | get-filehash | group -property hash | where { $_.count -gt 1 } | % { $_.group | select -skip 1 } | del

from: http://n3wjack.net/2015/04/06/find-and-delete-duplicate-files-with-just-powershell/

Sign up to request clarification or add additional context in comments.

Comments

4

Even though the question is old, I have been in a need to clean up all duplicate files based on content. The idea is simple, the algorithm for this is not straightforward. Here is the code which accepts a parameter of "path" to delete duplicates from.

 Function Delete-Duplicates {
    param(
    [Parameter(
    Mandatory=$True,
    ValueFromPipeline=$True,
    ValueFromPipelineByPropertyName=$True
    )]
    [string[]]$PathDuplicates)

    $DuplicatePaths = 
        Get-ChildItem $PathDuplicates | 
        Get-FileHash |
        Group-Object -Property Hash |
        Where-Object -Property Count -gt 1 |
        ForEach-Object {
            $_.Group.Path |
            Select -First ($_.Count -1)}
    $TotalCount = (Get-ChildItem $PathDuplicates).Count
 Write-Warning ("You are going to delete {0} files out of {1} total. Please confirm the prompt" -f $DuplicatePaths.Count, $TotalCount)    
 $DuplicatePaths | Remove-Item -Confirm

    }

The script

a) Lists all ChildItems

b) Retrieves FileHash from them

c) Groups them by Hash Property (so all the same files are in the single group)

d) Filters out the already-unique files (count of group -eq 1)

e) Loops through each group and lists all but last paths - ensuring one file of each "Hash" always stays

f) Warns before preceding, saying how many files are there in total and how many are going to be deleted.

Probably not the most performance-wise option (SHA1-ing every file) but ensures the file is a duplicate. Works perfectly fine for me :)

Comments

3

Keep a dictionary of files, delete when the next file name was already encountered before:

$dict = @{};
dir c:\admin -Recurse | foreach {
  $key = $_.Name #replace this with your checksum function
  $find = $dict[$key];
  if($find -ne $null) {
    #current file is a duplicate
    #Remove-Item -Path $_.FullName ?    
  }
  $dict[$key] = 0; #dummy placeholder to save memory
}

I used file name as a key, but you can use a checksum if you want (or both) - see code comment.

14 Comments

An array and a -contains check would suffice. No need for a dictionary.
@AnsgarWiechers: Performance issues on many files? I mean it would need to iterate over the array one by one every time and also recreate the array with every step, right?
I would place all the values in the curly braces of $dict = @{} separated by commas? Example: @{one.jar, two.jar}
@notec: not sure why you would want that, since your list would be dynamic. And no, you need to specify values for each - hash table is a key,value pair. See this link.
@Neolisk - I ran your script. I see that it finds all dups and assigns each the value of zero. How do I now remove the dups not assigned a value of zero? I see you have a Remove-Item line commented out with the pipeline object FullName questioned. Thanks for getting me this far.
|
2

Evolution of @KaiWang's answer which:

  1. Avoids calculating hash of every single file by comparing file length first;
  2. Allows choosing which file you want (here it keeps the file with the longest name).
Get-ChildItem *.ttf -Recurse |
  Group -Property Length |
  Where { $_.Count -gt 1 } |
  ForEach { $_.Group } |
  ForEach { $_ } |
  Get-FileHash -Algorithm 'MD5' |
  Group -Property Hash |
  Where { $_.Count -gt 1 } |
  ForEach {
    $_.Group |
      Sort -Property @{ Expression = { $_.Path.Length } } |
      Select -SkipLast 1
  } |
  ForEach { $_.Path } |
  ForEach {
    Write-Host $_
    Del -LiteralPath $_
  }

Comments

-1

Instead of just remove your duplicates files, you can replace by a shortcut

#requires -version 3
<#
    .SYNOPSIS
    Script de nettoyage des doublons
    .DESCRIPTION
    Cherche les doublons par taille, compare leur CheckSum MD5 et les regroupes par Taille et MD5
    peut remplacer chacun des doubles par un lien vers le 1er fichier, l'original

    .PARAMETER Path
    Chemin ou rechercher les doublon

    .PARAMETER ReplaceByShortcut
    si specifier alors les doublons seront remplacé

    .PARAMETER MinLength
    ignore les fichiers inferieure a cette taille (en Octets)

    .EXAMPLE
    .\Clean-Duplicate '\\dfs.adds\donnees\commun'

    .EXAMPLE
    recherche les doublon de 10Ko et plus
    .\Clean-Duplicate '\\dfs.adds\donnees\commun' -MinLength 10000

    .EXAMPLE
    .\Clean-Duplicate '\\dpm1\d$\Coaxis\Logiciels' -ReplaceByShortcut
#>
[CmdletBinding()]
param (
    [string]$Path = '\\Contoso.adds\share$\path\data',
    [switch]$ReplaceByShortcut = $false,
    [int]$MinLength = 10*1024*1024 # 10 Mo
)

$version = '1.0'

function Create-ShortCut ($ShortcutPath, $shortCutName, $Target) {
    $link = "$ShortcutPath\$shortCutName.lnk"
    $WshShell = New-Object -ComObject WScript.Shell
    $Shortcut = $WshShell.CreateShortcut($link)
    $Shortcut.TargetPath = $Target
    #$Shortcut.Arguments ="shell32.dll,Control_RunDLL hotplug.dll"
    #$Shortcut.IconLocation = "hotplug.dll,0"
    $Shortcut.Description ="Copy Doublon"
    #$Shortcut.WorkingDirectory ="C:\Windows\System32"
    $Shortcut.Save()
    # write-host -fore Cyan $link -nonewline; write-host -fore Red ' >> ' -nonewline; write-host -fore Yellow $Target 
    return $link
}

function Replace-ByShortcut {
    Param(
        [Parameter(ValueFromPipeline=$true,ValueFromPipelineByPropertyName=$true)]
            $SameItems
    )
    begin{
        $result = [pscustomobject][ordered]@{
            Replaced = @()
            Gain = 0
            Count = 0
        }
    }
    Process{
        $Original = $SameItems.group[0]
        foreach ($doublon in $SameItems.group) {
            if ($doublon -ne $Original) {
                $result.Replaced += [pscustomobject][ordered]@{
                    lnk = Create-Shortcut -ShortcutPath $doublon.DirectoryName -shortCutName $doublon.BaseName -Target $Original.FullName
                    target = $Original.FullName
                    size = $doublon.Length
                }
                $result.Gain += $doublon.Length
                $result.Count++
                Remove-item $doublon.FullName -force
            }
        }
    }
    End{
        $result
    }
}

function Get-MD5 {
    param (
        [Parameter(Mandatory)]
            [string]$Path
    )
    $HashAlgorithm = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
    $Stream = [System.IO.File]::OpenRead($Path)
    try {
        $HashByteArray = $HashAlgorithm.ComputeHash($Stream)
    } finally {
        $Stream.Dispose()
    }

    return [System.BitConverter]::ToString($HashByteArray).ToLowerInvariant() -replace '-',''
}

if (-not $Path) {
    if ((Get-Location).Provider.Name -ne 'FileSystem') {
        Write-Error 'Specify a file system path explicitly, or change the current location to a file system path.'
        return
    }
    $Path = (Get-Location).ProviderPath
}

$DuplicateFiles = Get-ChildItem -Path $Path -Recurse -File |
    Where-Object { $_.Length -gt $MinLength } |
        Group-Object -Property Length |
            Where-Object { $_.Count -gt 1 } |
                ForEach-Object {
                    $_.Group |
                        ForEach-Object {
                            $_ | Add-Member -MemberType NoteProperty -Name ContentHash -Value (Get-MD5 -Path $_.FullName)
                        }
                    $_.Group |
                        Group-Object -Property ContentHash |
                            Where-Object { $_.Count -gt 1 }
                }

$somme = ($DuplicateFiles.group | Measure-Object length -Sum).sum
write-host "$($DuplicateFiles.group.count) doublons, soit $($somme/1024/1024) Mo" -fore cyan

if ($ReplaceByShortcut) {
    $DuplicateFiles | Replace-ByShortcut
} else {
    $DuplicateFiles
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.