0

I'm using javascript regex to do the following:

I have the html content of a page saved inside a string, and I want to match all URLs on the page.

For example, if the document contains--

<script src = "http://www.a.com">
<a href="http://www.b.com">
<a href= "http://www.c.com">
<a href ="http://www.d.com">

I want the match to be--

http://www.a.com
http://www.b.com
http://www.c.com
http://www.d.com

Any help would be appreciated, thanks!

3
  • Are your url's really that simple, or will they contain parameters or longer paths? Commented Jan 10, 2011 at 1:53
  • /me facepalms stackoverflow.com/questions/1732348/… Commented Jan 10, 2011 at 2:40
  • @Hello71 I have done as you have asked, I have parsed the HTML with HTML5 Lib, I have fetched all the links, I have fixed all the encoding bugs, all the unknown unsupported unicode symbols and finally after weeks of work got those links from that html. Was it worth it? Maybe. Is the added complexity worth it? No it is not, parsing HTML is a lot harder than you think, HTML can contain other types of content and is extremely complicated, regex matching links might actually be the better answer here... that or a custom parser (which I also tried, great for really long texts). Commented Aug 5, 2014 at 9:37

2 Answers 2

2

John Gruber has an excellent regex for URLs over at his site, Daring Fireball: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

You can implement it like so:

function regex(url) {
    var regex = /(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/
    return regex.test(url);
}
Sign up to request clarification or add additional context in comments.

2 Comments

I get an error of misplaced | with that code - this gist works well: gist.github.com/1033143. It uses the same regex.
Matches URLs not links/anchors which is not exactly the same
0
function isUrl(url) {
    var regexp = /(http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/
    return regexp.test(url);
}

It is a bit more generic, but you may modify it for your needs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.