14

I know this has been asked a thousand times before (apologies), but searching SO/Google etc I am yet to get a conclusive answer.

Basically, I need a JS function which when passed a string, identifies & extracts all URLs based on a regex, returning an array of all found. e.g:

function findUrls(searchText){
    var regex=???
    result= searchText.match(regex);
    if(result){return result;}else{return false;}
}

The function should be able to detect and return any potential urls. I am aware of the inherant difficulties/isses with this (closing parentheses etc), so I have a feeling the process needs to be:

Split the string (searchText) into distinct sections starting/ending) with either nothing, a space or carriage return either side of it, resulting in distinct content chunks, e.g. do a split.

For each content chunk that results from the split, see whether it fits the logic for a URL of any construction, namely, does it contain a period immediately followed the text (the one constant rule for qualifying a potential URL).

The regex should see whether the period is immediately followed by other text, of the type allowable for a tld, directory structure & query string, and preceded by text of the allowable type for a URL.

I am aware false positives may result, however any returned values will then be checked with a call to the URL itself, so this can be ignored. The other functions I have found often dont return the URLs query string too, if present.

From a block of text, the function should thus be able to return any type of URL, even if it means identifying will.i.am as a valid one!

eg. http://www.google.com, google.com, www.google.com, http://google.com, ftp.google.com, https:// etc...and any derivation thereof with a query string should be returned...

Many thanks, apologies again if this exists elsewhere on SO but my searches havent returned it..

4
  • Possible duplicate: stackoverflow.com/questions/1986121/… Commented Jun 26, 2012 at 14:02
  • 1
    People should stop prefixing JS variable with $... JS is not PHP! Commented Jun 26, 2012 at 14:42
  • Sorry- had my head in PHP all day, will remove! Commented Jun 26, 2012 at 14:59
  • Re: the poss duplicate, the regex in the listed question doesnt answer all the criteria I set out.. Commented Jun 27, 2012 at 7:53

5 Answers 5

27

I just use URI.js -- makes it easy.

var source = "Hello www.example.com,\n"
    + "http://google.com is a search engine, like http://www.bing.com\n"
    + "http://exämple.org/foo.html?baz=la#bumm is an IDN URL,\n"
    + "http://123.123.123.123/foo.html is IPv4 and "
    + "http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html is IPv6.\n"
    + "links can also be in parens (http://example.org) "
    + "or quotes »http://example.org«.";

var result = URI.withinString(source, function(url) {
    return "<a>" + url + "</a>";
});

/* result is:
Hello <a>www.example.com</a>,
<a>http://google.com</a> is a search engine, like <a>http://www.bing.com</a>
<a>http://exämple.org/foo.html?baz=la#bumm</a> is an IDN URL,
<a>http://123.123.123.123/foo.html</a> is IPv4 and <a>http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html</a> is IPv6.
links can also be in parens (<a>http://example.org</a>) or quotes »<a>http://example.org</a>«.
*/
Sign up to request clarification or add additional context in comments.

Comments

16

You could use the regex from URI.js:

// gruber revised expression - http://rodneyrehm.de/t/url-regex.html
var uri_pattern = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/ig;

String#match and or String#replace may help…

2 Comments

Note that using a regex - this one in particular - can cause problems ("catastrophic backtracking") - github.com/medialize/URI.js/issues/131 - I'd go with @chovy's answer and use URI.withinString()
The regex in this answer is vulnerable to ReDoS from strings such as "[https://stackoverflow.com/questions/11209016/javascript-extract-urls-from-string-inc-querystring-and-return-array/11209098#11209098](https://stackoverflow.com/questions/11209016/javascript-extract-urls-from-string-inc-querystring-and-return-array/11209098#11209098)"
3

Following regular expression extract URLs from string (inc. query string) and returns array

var url = "asdasdla hakjsdh aaskjdh https://www.google.com/search?q=add+a+element+to+dom+tree&oq=add+a+element+to+dom+tree&aqs=chrome..69i57.7462j1j1&sourceid=chrome&ie=UTF-8 askndajk nakjsdn aksjdnakjsdnkjsn";

var matches = strings.match(/\bhttps?::\/\/\S+/gi) || strings.match(/\bhttps?:\/\/\S+/gi);

Output:

["https://www.google.com/search?q=format+to+6+digir&…s=chrome..69i57.5983j1j1&sourceid=chrome&ie=UTF-8"]

Note: This handles both http:// with single colon and http::// with double colon in string, vice versa for https, So it's safe for you to use. :)

Comments

1

try this

var expression = /[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi;

you could use this website to test regexp http://gskinner.com/RegExr/

Comments

1

In UIPath Studio the following built-in regex rule has been defined:

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-a-zA-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-a-zA-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-a-zA-Z0-9+&@#\/%=~_|$?!:,.]*\)|[a-zA-Z0-9+&@#\/%=~_|$])/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.