Extract URL from text file then parse using Powershell

Question

I am trying to extract a URL from a text file. I am using PowerShell to do this. The last part of the URL will be different each time. A snippet of the file is as follows:

<table class="button" style="border-collapse: collapse; border-spacing: 0; overflow: 
hidden; padding: 0; text-align: left; vertical-align: top; width: 100%;"><tbody>
<tr style="padding: 0; text-align: left; vertical-align: top;"><td style="-moz-hyphens: none; 
-webkit-hyphens: none; -webkit-text-size-adjust: none; background: #049FD9; 
border: none; border-collapse: collapse !important; border-radius: 2px; color: #fff; display: block; font-family: 'Helvetica-Light','Arial',sans-serif; font-size: 14px; font-weight: lighter; hyphens: none; line-height:19px; margin: 0; padding: 8px 16px; text-align: center; vertical-align: top; width: auto 
!important; word-break: keep-all;">
<a href="https://www.website.com:443/idb/setPassword?t=BcHJEoIgAADQD%2BKQjqZ4VEKtBHLJJm82uWDuxCR%2Bfe%2B58Rl9HRz6QddWkO5MLDXuF6e9m%2Bo0z%2FCVS%2B9IenAp5m5yTfYRa%2BAn4jdWHHF7HTyqRZiRRiNDEE%2BK7ZJywLKeNCTj4ewu4QNu02qXB0ZTXTyxXADwaLeluZGVPCxGXunpVcHbiCVAWRR7ykqGensLVBsqNUpl%2FQE%3D" 
style="-webkit-text-size-adjust: none; font-weight: 100; color: #fff; font-family: 'Helvetica-Light','Arial',sans-serif; font-size: 20px; font-weight: lighter; line-height: 32px; text-decoration: none;">Get Started</a> </td></tr></tbody></table></td>

I want to extract the URL that starts with:

https://www.website.com:443/idb/setPassword

The string after the t= will be different each time. How can I extract the entire URL into a variable that I can then parse to get the info I need, which is the string of characters after the ?t=?

Richard · Accepted Answer · 2017-05-02 11:53:43Z

3

Try the following:

$content = Get-Content -Path 'C:\test.txt'
[regex]$regex = '(?<=href="https:\/\/www\.website\.com:443\/idb\/setPassword\?t=)(.*)(?=" )'
$regex.Matches($content).Value

In $content replace the path with your text file that contains the URL and update the $regex with the correct URL to the site.

This method use Regex to match before (?<= ) the websites URL and after (?= ), and then selects the text in the middle.

edited May 2, 2017 at 11:53

answered May 2, 2017 at 11:48

Richard

7,0685 gold badges48 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jason Murray Over a year ago

Seems like it always happens but closely after I posted this I figured out my issue that the stupid output was formatting the text docuemnt and word wrapping the URL with spaces so it was never matching. So I ended up using the -Width 999999 command to make sure the URL was on one line. After figuring that out I ended up using this similar regex, as you posted, to pull that line out and match the string of characters I needed. Thanks for your answer.

Mark Wragg · Accepted Answer · 2017-05-02 12:03:11Z

3

Here is a solution that uses a combination of Select-String with a regular expression to get the URL and the [system.uri] class to interrogate it.

$Text = get-content 'html-sample.txt'
$URLString = ((Select-String '(http[s]?)(:\/\/)([^\s,]+)(?=")' -Input $Text).Matches.Value)

#At this point $URL is a string with just the URL and querystring as requested
$URLString

#Heres how you might interrogate it
[system.uri]$URL = $URLString
$Token = ($URL.Query -split '=')[1]
$URL.host
$Token

Explanation:

Uses the regular expression (http[s]?)(:\/\/)([^\s,]+)(?=") with Select-String to extract the URL. Note this will only get the first match by default, use the -AllMatches switch of Select-String if you need to match multiple URLs and then you'll need to deal with each result via a ForEach loop.
Uses [system.uri] to cast the URL as a URI object.
Access the host property of the object to return the base URL.
Accesses the query property of the object to return the query string and replaces the '?t=' part of the string using a regex that only does the replace where it appears in the beginning of the string (^ token) and using backslashes to escape the other regex special characters.

edited May 2, 2017 at 12:03

answered May 2, 2017 at 11:48

Mark Wragg

23.6k7 gold badges48 silver badges77 bronze badges

2 Comments

Jason Murray Over a year ago

Thanks Mark for your answer and explanations, I ended up using the matches parameter above but tried this out and it works as well. Thanks!

Mark Wragg Over a year ago

I just fixed a small bug in it as a I noticed the regex was grabbing the double quote at the end of the URL as well and then that was being encoded as %22 on the end of the token. Added a -replace to strip out any double quotes.

Ricc Babbitt · Accepted Answer · 2017-05-03 08:15:54Z

2

here's another way by casting [xml] to read the file as an xmldocument....

$thisxml = [xml](gc .\hypertext.html)

then drill down to the node you want using xpath

$thisxpath = ($thisxml).SelectNodes("//table//tr//td//a").href

then cast [system.uri] to parse and select the uri pieces you want.

$thisuri = [System.Uri]$thisxpath | %{($_.Scheme + "://" + $_.host + $_.LocalPath)}

answered May 3, 2017 at 8:15

Ricc Babbitt

3921 silver badge7 bronze badges

Collectives™ on Stack Overflow

Extract URL from text file then parse using Powershell

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related