1

I have a Powershell script which reads a 4000 KB text file (88,500 lines approx) This is the first time I have had my code do this much work. The script below took over 2 minutes to run and consumed around 20% CPU (see Task Manager screenshot below) enter image description here Can I improve performance using different code choices?

# extractUniqueBaseNames.ps1    --- copy first UPPERCASE word in each line of text, remove duplicates & store

$listing = 'C:\roll-conversion dump\LINZ Place Street Index\StreetIndexOutput.txt'

[array]$tempStorage = $null
[array]$Storage = $null

# select only CAPITALISED first string (at least two chars or longer) from listings
Select-String -Pattern '(\b[A-Z]{2,}\b[$\s])' -Path $listing -CaseSensitive |
    ForEach-Object {$newStringValue = $_.Matches.Value -replace '$\s', '\n' 
                    $tempStorage += $newStringValue 
                    }

    $Storage += $tempStorage | Select-Object -Unique

I have also added the following line to output results to a new text file (this was not included for the previous Task Manager reading):

$Storage | Out-File -Append atest.txt

Since I am at an early stage of my development I would appreciate any suggestions that would improve the performance of this kind of Powershell script.

3
  • 2
    Can you explain why you do this at the end of your script ? $Storage += $tempStorage Commented Jun 28, 2022 at 20:54
  • 1
    As I understand it Santiago $tempStorage contains matches of the regex. These UPPERCASE 'words' are the base names of streets. That is just the identifying name. Because towns have some street names in common (eg Main Street) they get duplicated. $Storage stores the data piped using Select-Object -Unique to remove the duplicates. So that this is only unique street base names. Commented Jun 29, 2022 at 1:11
  • 1
    I see, I updated the code, I believe it should do what you're after just faster and more efficient. Commented Jun 29, 2022 at 2:00

1 Answer 1

1

If I understand correctly your code, this should do the same but faster and more efficient.

Reference documentations:

using namespace System.IO
using namespace System.Collections.Generic

try {
    $re      = [regex] '(\b[A-Z]{2,}\b[$\s])'
    $reader  = [StreamReader] 'some\path\to\inputfile.txt'
    $stream  = [File]::Open('some\path\to\outputfile.txt', [FileMode]::Append, [FileAccess]::Write)
    $writer  = [StreamWriter]::new($stream)
    $storage = [HashSet[string]]::new()

    while(-not $reader.EndOfStream) {
        # if the line matches the regex
        if($match = $re.Match($reader.ReadLine())) {
            $line = $match.Value -replace '$\s', '\n'
            # if the line hasn't been found before
            if($storage.Add($line)) {
                $writer.WriteLine($line)
            }
        }
    }
}
finally {
    ($reader, $writer, $stream).ForEach('Dispose')
}
Sign up to request clarification or add additional context in comments.

2 Comments

Success! Yes. Lightening fast compared to the previous version. The new session did it. Thank you. I will work on 'getting my head around' your code.
@Dave glad it worked. it's using almost fully .NET APIs, the code is expected to be more complicated. you asked for performance though, built-in PowerShell cmdlets are usually not the fastest or performant

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.