How to Improve the Performance of this Powershell Code

Question

I have a Powershell script which reads a 4000 KB text file (88,500 lines approx) This is the first time I have had my code do this much work. The script below took over 2 minutes to run and consumed around 20% CPU (see Task Manager screenshot below) Can I improve performance using different code choices?

# extractUniqueBaseNames.ps1    --- copy first UPPERCASE word in each line of text, remove duplicates & store

$listing = 'C:\roll-conversion dump\LINZ Place Street Index\StreetIndexOutput.txt'

[array]$tempStorage = $null
[array]$Storage = $null

# select only CAPITALISED first string (at least two chars or longer) from listings
Select-String -Pattern '(\b[A-Z]{2,}\b[$\s])' -Path $listing -CaseSensitive |
    ForEach-Object {$newStringValue = $_.Matches.Value -replace '$\s', '\n' 
                    $tempStorage += $newStringValue 
                    }

    $Storage += $tempStorage | Select-Object -Unique

I have also added the following line to output results to a new text file (this was not included for the previous Task Manager reading):

$Storage | Out-File -Append atest.txt

Since I am at an early stage of my development I would appreciate any suggestions that would improve the performance of this kind of Powershell script.

Can you explain why you do this at the end of your script ? $Storage += $tempStorage — Santiago Squarzon
– Santiago Squarzon, Commented Jun 28, 2022 at 20:54
As I understand it Santiago $tempStorage contains matches of the regex. These UPPERCASE 'words' are the base names of streets. That is just the identifying name. Because towns have some street names in common (eg Main Street) they get duplicated. $Storage stores the data piped using Select-Object -Unique to remove the duplicates. So that this is only unique street base names. — Dave
– Dave, Commented Jun 29, 2022 at 1:11
I see, I updated the code, I believe it should do what you're after just faster and more efficient. — Santiago Squarzon
– Santiago Squarzon, Commented Jun 29, 2022 at 2:00

Santiago Squarzon · Accepted Answer · 2022-06-29 04:53:06Z

1

If I understand correctly your code, this should do the same but faster and more efficient.

Reference documentations:

using namespace System.IO
using namespace System.Collections.Generic

try {
    $re      = [regex] '(\b[A-Z]{2,}\b[$\s])'
    $reader  = [StreamReader] 'some\path\to\inputfile.txt'
    $stream  = [File]::Open('some\path\to\outputfile.txt', [FileMode]::Append, [FileAccess]::Write)
    $writer  = [StreamWriter]::new($stream)
    $storage = [HashSet[string]]::new()

    while(-not $reader.EndOfStream) {
        # if the line matches the regex
        if($match = $re.Match($reader.ReadLine())) {
            $line = $match.Value -replace '$\s', '\n'
            # if the line hasn't been found before
            if($storage.Add($line)) {
                $writer.WriteLine($line)
            }
        }
    }
}
finally {
    ($reader, $writer, $stream).ForEach('Dispose')
}

edited Jun 29, 2022 at 4:53

answered Jun 28, 2022 at 21:14

Santiago Squarzon

65.5k5 gold badges26 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dave Over a year ago

Success! Yes. Lightening fast compared to the previous version. The new session did it. Thank you. I will work on 'getting my head around' your code.

Santiago Squarzon Over a year ago

@Dave glad it worked. it's using almost fully .NET APIs, the code is expected to be more complicated. you asked for performance though, built-in PowerShell cmdlets are usually not the fastest or performant

Collectives™ on Stack Overflow

How to Improve the Performance of this Powershell Code

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related