0

Given the following string:

'<p><a href="china">China</a><br><a href="india">India</a><br><a
href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a
href="thailand">Thailand</a></p>'

I'd like to use Powershell to extract all of the countries listed therein. In other words I want to return @(China,India,Korea,Malaysia,Thailand).

Have tried using regex but can't find the right pattern, for example:

'<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'  -match '(<a href="[A-Z a-z]*">[A-Z a-z]*</a>)+'
$matches

Which returns:

Name                           Value                                                                                                                                                                                            
----                           -----                                                                                                                                                                                            
1                              <a href="china">China</a>                                                                                                                                                                        
0                              <a href="china">China</a>

Any suggestions? Is regex the right approach here?

P.S. Note that the snippet is not well-formed so I can't simply convert it to XML.

4 Answers 4

3

$Matches automatic variable contains information about matched capturing groups of last -match operation, not information about matches. If you want to get multiple matches of pattern, then you have to use Matches method from [Regex] class:

$InputString='<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'
$Pattern='<a href="[A-Z a-z]*">([A-Z a-z]*)</a>'
$Countries=[Regex]::Matches($InputString,$Pattern)|ForEach-Object {$_.Groups[1].Value}
$Countries

Although for parsing HTML you better to use some HTML parser as other answer propose to you.

Sign up to request clarification or add additional context in comments.

Comments

1

Regular expressions are never a good way to handle HTML (though often they are tempting). You can parse the HTML and extract the data you want without using any regex:

PS C:\> $d = '<p><a href="china">China</a><br><a href="india">India</a><br><a
href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a
href="thailand">Thailand</a></p>'


PS C:\> $html = New-Object -ComObject "HTMLFile"

PS C:\> $html.IHTMLDocument2_write($d)

PS C:\> $html.getElementsByTagName('A') | select -expandProperty innerText
China
India
Korea
Malaysia
Thailand

2 Comments

Nice. Didn't know about expandProperty. Thanks Duncan.
Mostly with Powershell 3 and later you don't need to use expandProperty as you can usually just use dot notation. I don't know why it doesn't work here: ($html.getElementsByTagName('A')).innerText gives nothing while $html.getElementsByTagName('A') | select -expandProperty innerText works fine. I guess it must be because $html is a COM object.
0

The following Regex should do the trick:

(?<=><a\shref="\w+">)\w+

ML

Comments

0
$InputString='<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'
$Pattern='(?<=>)\w+?(?=<)'

([Regex]::Matches($InputString,$Pattern)).Value

China

India

Korea

Malaysia

Thailand

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.