1

I am a PowerShell noob looking for a way to find duplicate files in a directory and write the file paths of the files to a text file or csv file. My current code is working, but is extremely inefficient and slow. Any recommendations would be greatly appreciated

#Declaring the Array to store file paths and names
$arr = (get-childitem "My Path" -recurse | where {$_.extension -like '*.*'})

#creating an array to hold already found duplicate elements in order to skip over them in the iteration
$arrDupNum = -1

#Declaring for loop to iterate over the array
For ($i=0; $i -le $arr.Length - 1; $i++) {
    $percent = $i / $arr.Length * 100
    Write-Progress -Activity "ActivityString" -Status "StatusString" -PercentComplete $percent -CurrentOperation "CurrentOperationString"
    
    $trigger = "f"
    
    For ($j = $i + 1; $j -le $arr.Length - 1; $j++)
    {
        foreach ($num in $arrDupNum)
        {
            #if statement to skip over duplicates already found
            if($num -eq $j -and $j -le $arr.Length - 2)
            {
                $j = $j + 1
            }            
        }

        if ($arr[$j].Name -eq $arr[$i].Name)
            {
                $trigger = "t"
                Add-Content H:\Desktop\blank.txt ($arr[$j].FullName + "; " + $arr[$i].FullName)
                Write-Host $arr[$i].Name
                $arrDupNum += $j
            }
    }
    #trigger used for formatting the text file in csv format
    if ($trigger -eq "t")
    {
    Add-Content H:\Desktop\blank.txt (" " + "; " + " ")
    }
}
3
  • How many files and directories are in "My Path"? Is that on a local disk or a network share? Is H:\Desktop\blank.txt on a local disk or a network share? Is it possible it's slow because...enumerating deep directories can be slow? Commented Aug 30, 2019 at 15:45
  • over a million files. Yes it is a network share. the H:desktop is local. Yes, it will be slow no matter what, but I guess what I am wondering is if there is a faster way to find the duplicate files as far as my code Commented Aug 30, 2019 at 15:55
  • I see. There are improvements that could be made to your code, I was just trying to assess how slow is "slow" and how you know it's your code that is "inefficient and slow" since the question didn't describe that. By the way, by immediately accepting the first/only answer only 15 minutes after the question was asked, be aware that you are potentially discouraging others from answering if they think your problem is completely solved. Commented Aug 30, 2019 at 16:15

2 Answers 2

2

Use a hashtable to group the files by name:

$filesByName = @{}

foreach($file in $arr){
    $filesByName[$file.Name] += @($file)
}

Now we just need to find all hashtable entries with more than one file:

foreach($fileName in $filesByName.Keys){
    if($filesByName[$fileName].Count -gt 1){
        # Duplicates found!
        $filesByName[$fileName] |Select -Expand FullName |Add-Content .\duplicates.txt
    }
}

This way, when you have N files, you'll at most iterate over them N*2 times, instead of N*N times :)

Sign up to request clarification or add additional context in comments.

3 Comments

@bman,since you are searching for a duplicate files, I recommend to other condition to be added in the hash table , like creation date or last modified date ... depend on your needs.
I just tried this and it doesn't work. It doesn't even create the file duplicates.txt. Even if I create this file myself, it doesn't modify the file in any way. I created powershell script with the code from the code blocks and it doesn't do anything. What's the $arr variable? Is it supposed to be filled by me? The script is incomplete.
Ok. I added $arr = (Get-ChildItem -Path ".\" -Recurse) at the start of the script and it outputs all the duplicate files in the file.
1

The other answer tackles the most significant improvement you can make, but there's a couple other tweaks that might improve performance.

When you use Where-Object to filter by the Extension property, that filtering is done in PowerShell itself. For a simple pattern like you're using, you can have a lower-level API do the filtering using the -Filter parameter of Get-ChildItem...

$arr = (get-childitem "My Path" -recurse -Filter '*.*')

That pattern, of course, is specifically filtering for entries whose name contain a .. If you meant it as a DOS-style "all files" pattern, you could use '*' or, better yet, just omit the filter entirely. On the subject of "all files", it's important to point out that Get-ChildItem does not include hidden files by default. To include those in your search, use the -Force parameter...

$arr = (get-childitem "My Path" -recurse -Filter '*.*' -Force)

Also, be aware that Get-ChildItem will return both file and directory objects from a filesystem. That is, the code in the question will look at directory names, too, in its search for duplicates. If, as the question suggests, you want to restrict it to files you can use the -File parameter of Get-ChildItem...

$arr = (get-childitem "My Path" -recurse -Filter '*.*' -File)

Note that parameter first became available in PowerShell 3.0, but as that is several versions old I'm sure it will work for you.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.