3

I am trying to extract a URL from a text file. I am using PowerShell to do this. The last part of the URL will be different each time. A snippet of the file is as follows:

<table class="button" style="border-collapse: collapse; border-spacing: 0; overflow: 
hidden; padding: 0; text-align: left; vertical-align: top; width: 100%;"><tbody>
<tr style="padding: 0; text-align: left; vertical-align: top;"><td style="-moz-hyphens: none; 
-webkit-hyphens: none; -webkit-text-size-adjust: none; background: #049FD9; 
border: none; border-collapse: collapse !important; border-radius: 2px; color: #fff; display: block; font-family: 'Helvetica-Light','Arial',sans-serif; font-size: 14px; font-weight: lighter; hyphens: none; line-height:19px; margin: 0; padding: 8px 16px; text-align: center; vertical-align: top; width: auto 
!important; word-break: keep-all;">
<a href="https://www.website.com:443/idb/setPassword?t=BcHJEoIgAADQD%2BKQjqZ4VEKtBHLJJm82uWDuxCR%2Bfe%2B58Rl9HRz6QddWkO5MLDXuF6e9m%2Bo0z%2FCVS%2B9IenAp5m5yTfYRa%2BAn4jdWHHF7HTyqRZiRRiNDEE%2BK7ZJywLKeNCTj4ewu4QNu02qXB0ZTXTyxXADwaLeluZGVPCxGXunpVcHbiCVAWRR7ykqGensLVBsqNUpl%2FQE%3D" 
style="-webkit-text-size-adjust: none; font-weight: 100; color: #fff; font-family: 'Helvetica-Light','Arial',sans-serif; font-size: 20px; font-weight: lighter; line-height: 32px; text-decoration: none;">Get Started</a> </td></tr></tbody></table></td>

I want to extract the URL that starts with:

https://www.website.com:443/idb/setPassword

The string after the t= will be different each time. How can I extract the entire URL into a variable that I can then parse to get the info I need, which is the string of characters after the ?t=?

3 Answers 3

3

Try the following:

$content = Get-Content -Path 'C:\test.txt'
[regex]$regex = '(?<=href="https:\/\/www\.website\.com:443\/idb\/setPassword\?t=)(.*)(?=" )'
$regex.Matches($content).Value

In $content replace the path with your text file that contains the URL and update the $regex with the correct URL to the site.

This method use Regex to match before (?<= ) the websites URL and after (?= ), and then selects the text in the middle.

Sign up to request clarification or add additional context in comments.

1 Comment

Seems like it always happens but closely after I posted this I figured out my issue that the stupid output was formatting the text docuemnt and word wrapping the URL with spaces so it was never matching. So I ended up using the -Width 999999 command to make sure the URL was on one line. After figuring that out I ended up using this similar regex, as you posted, to pull that line out and match the string of characters I needed. Thanks for your answer.
3

Here is a solution that uses a combination of Select-String with a regular expression to get the URL and the [system.uri] class to interrogate it.

$Text = get-content 'html-sample.txt'
$URLString = ((Select-String '(http[s]?)(:\/\/)([^\s,]+)(?=")' -Input $Text).Matches.Value)

#At this point $URL is a string with just the URL and querystring as requested
$URLString

#Heres how you might interrogate it
[system.uri]$URL = $URLString
$Token = ($URL.Query -split '=')[1]
$URL.host
$Token

Explanation:

  • Uses the regular expression (http[s]?)(:\/\/)([^\s,]+)(?=") with Select-String to extract the URL. Note this will only get the first match by default, use the -AllMatches switch of Select-String if you need to match multiple URLs and then you'll need to deal with each result via a ForEach loop.
  • Uses [system.uri] to cast the URL as a URI object.
  • Access the host property of the object to return the base URL.
  • Accesses the query property of the object to return the query string and replaces the '?t=' part of the string using a regex that only does the replace where it appears in the beginning of the string (^ token) and using backslashes to escape the other regex special characters.

2 Comments

Thanks Mark for your answer and explanations, I ended up using the matches parameter above but tried this out and it works as well. Thanks!
I just fixed a small bug in it as a I noticed the regex was grabbing the double quote at the end of the URL as well and then that was being encoded as %22 on the end of the token. Added a -replace to strip out any double quotes.
2

here's another way by casting [xml] to read the file as an xmldocument....

$thisxml = [xml](gc .\hypertext.html)

then drill down to the node you want using xpath

$thisxpath = ($thisxml).SelectNodes("//table//tr//td//a").href

then cast [system.uri] to parse and select the uri pieces you want.

$thisuri = [System.Uri]$thisxpath | %{($_.Scheme + "://" + $_.host + $_.LocalPath)}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.