1

I'm having a block solving this. I want to get all the URL's in the text that match my pattern. Should include the first parm of the URL, but not the second one.

Two issues:

  1. It's not getting the first URL
  2. I'm missing how the capture works.

In Method 1, I see the matches, but I don't see the capture text of what I put in parentheses. In Method 2, I see my captures on some outputs, but getting extra outputs that contain more than my capture. I like Method 2 style, but did Method 1 to try to understand what's happening, but just dug my self a deeper hole.

$fileContents = 'Misc Text < a href="http://example.com/Test.aspx?u=a1">blah blah</a>  More Stuff <a href="http://example.com/Test.aspx?u=b2&parm=123">blah blah </a> Closing Text'


#Sample URL           http://example.com/Test.aspx?u=a1&parm=123 
$pattern = '<a href="(http://example.com/Test.aspx\?u=.*?)[&"]'
Write-Host "RegEx Pattern=$pattern"

Write-Host "----------- Method 1 --------------"  
$groups = [regex]::Matches($fileContents, $pattern)
$groupnum = 0 
foreach ($group in $groups)  
{
    Write-Host "Group=$groupnum URL=$group " 
    $capturenum = 0 
    foreach ($capture in $group.Captures) 
    {
        Write-Host "Group=$groupnum Capture=$capturenum URL=$capture.value index=$($capture.index)" 
        $capturenum = $capturenum + 1 
    }
    $groupnum = $groupnum + 1 
}

Write-Host "----------- Method 2 --------------"  
$urls = [regex]::Matches($fileContents, $pattern).Groups.Captures.Value 
#$urls = $urls | select -Unique

Write-Host "Number of Matches = $($urls.Count)"


foreach ($url in $urls) 
    {
    Write-Host "URL: $url "
    }

Write-Host " " 

Output:

----------- Method 1 --------------
Group=0 URL=<a href="http://example.com/Test.aspx?u=b2& 
Group=0 Capture=0 URL=<a href="http://example.com/Test.aspx?u=b2&.value index=81
----------- Method 2 --------------
Number of Matches = 2
URL: <a href="http://example.com/Test.aspx?u=b2& 
URL: http://example.com/Test.aspx?u=b2 

Powershell Version 5.1.17763.592

4
  • Select-String -Pattern '(?<=a href=")[^"]*' -AllMatches Commented Jul 19, 2019 at 14:14
  • 1
    The first URL is not matched because you have an extra space between < and a. Commented Jul 19, 2019 at 14:30
  • @AnsgarWiechers I like using the native way, but still cannot get it to work: $urls = Select-String -InputObject $fileContents -Pattern '(?<=a href=")[^"]*' -AllMatches Commented Jul 19, 2019 at 14:43
  • You need to expand the value of the matches produced by Select-String. Commented Jul 19, 2019 at 14:45

2 Answers 2

1

I'm missing how the capture works.

Capture group 0 is always the entire match - unnamed capture groups will be numbered 1 through 9, so you'll want group 1.

I've renamed the variables to make their meaning a little more clear:

$MatchList = [regex]::Matches($fileContents, $pattern)

foreach($Match in $MatchList){
  for($i = 0; $i -lt $Match.Groups.Count; $i++){
    "Group $i is: $($Match.Groups[$i].Value)"
  }
}

If you want to collect all the captured url's, just do:

$urls = foreach($Match in $MatchList){
  $Match.Groups[$i].Value
}

If you only need the first match you don't need to invoke [regex]::Matches() manually though - PowerShell will automatically inject the string value of any captured groups into the automatic $Matches variable when you use the -match operator, so if you do:

if($fileContents -match $pattern){
    "Group 1 is $($Matches[1])"
}
# or
if($fileContents -match $pattern){
    $url = $Matches[1]
}

... you'll get the expected result:

Group 1 is http://example.com/Test.aspx?u=b2
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! Is there a shortcut to do something like this: $urls = $MatchList.Groups[1].Value. Also did you mean to leave the $groups in $matchList = $groups = [regex]... or was that just a copy/paste typo.
@NealWalters That was a copy-pasta error :) updated answer
Also in your section to get just get all captured URLS: This is all that is needed (just need the subscript 1) : $urls = foreach($Match in $MatchList) { $Match.Groups[1].Value }
1

Use Select-String with the parameter -AllMatches to get all matches from your input string. Your regular expression should look like this: (?<=a href=")[^"]*. That will match any character that is not a double quote after the string a href=" (with that last string not being included in the match). Now you just need to expand the value of the matches and you're done.

$re = '(?<=a href=")[^"]*'
$fileContents |
    Select-String -Pattern $re -AllMatches |
    Select-Object -Expand Matches |
    Select-Object -Expand Value

6 Comments

Nice, that syntax has always confused me. Your pattern returns the entire URL, I didn't want the &parm=123. When I substitute my $pattern it's returning too much.
@NealWalters You can simply remove the trailing parameters by running -replace '&.*' (or -replace '\?.*' if you want to remove the entire parameter list) on the result. It's easier to extract the whole URLs first and trim them later.
I was trying to use a capture of everything in (). Might be good for future applications as well. You have () as well - so is yours not doing the capture of only what is between the parentheses? I guess my point is that my RegEx was correct, and you seem to be solving a different problem.
@NealWalters What I have in my answer is (as already mentioned) a positive lookbehind assertion: (?<=...). Lookaround assertions allow matching parts of a string without including those parts in the match that is returned. Essentially, the code in my answer returns the URL (and only the URL) as the full match, so you don't need to fiddle around with capturing groups.
Thanks, but I didn't want the whole URL - only up to the end of the first parameter.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.