Python regular expression to get URL

Question

I am trying to get a URL out of a long string and I am unsure how write the regex;

$ string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'

I am trying to use the 're.search' function in order to pull out the WWW.WEBSITE.COM only without spaces. I would like it look like this;

$ get_site = re.search(regex).group()

$ print get_site

$ WWW.WEBSITE.COM

try this WWW\.(.*)\.COM. I am no expert in regex that's why I am commenting. Probably someone else can provide a better non-greedy one ? — Vipul
– Vipul, Commented Jun 11, 2014 at 20:41
Your question is pretty vague. Why shouldn't it return the IP address? It's just as valid as a URL. — Kendall Frey
– Kendall Frey, Commented Jun 11, 2014 at 20:42
hi Vipul, thanks for your input. I actually need a more robust method for this. Not all of my sites will start with WWW or anything like that, BUT they will all be in between a (-) and the (GET)!! — MorganTN
– MorganTN, Commented Jun 11, 2014 at 20:42

score 7 · Accepted Answer · 2014-06-11 20:56:21Z

7

BUT they will all be in between a (-) and the (GET)

That is all the information you need:

>>> import re
>>> string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
>>> re.search('-\s+(.+?)\s+GET', string).group(1)
'WWW.WEBSITE.COM'
>>>

Below is a breakdown of what the Regex pattern is matching:

-      # -
\s+    # One or more spaces
(.+?)  # A capture group for one or more characters
\s+    # One or more spaces
GET    # GET

Note too that .group(1) gets the text captured by (.+?). .group() would return the entire match:

>>> re.search('-\s+(.+?)\s+GET', string).group()
'- WWW.WEBSITE.COM GET'
>>>

edited Jun 11, 2014 at 20:56

answered Jun 11, 2014 at 20:42

user2555451

Sign up to request clarification or add additional context in comments.

Comments

Reloader · Accepted Answer · 2014-06-11 20:50:03Z

0

WWW\.(.+)\.[A-Z]{2,3}

WWW        #WWW
\.         #dot
(.+)       #one or more arbitrary characters
\.         #dot, again
[A-Z]{2,3} #two or three alphabetic uppercase characters (as there are .eu domain, for example)

answered Jun 11, 2014 at 20:50

Reloader

70811 silver badges22 bronze badges

1 Comment

Leopold Asperger Over a year ago

That regex won't cover subdomains, IP addresses and tld's longer than 3 such as info.

Leopold Asperger · Accepted Answer · 2014-06-12 12:27:42Z

I wrote the following regex a while ago for a PHP project, its based on the dedicated RFC so it will cover any valid URL. I remember I tested it extensively too, so it should be reliable.

const re_host = '(([a-z0-9-]+\.)+[a-z]+|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3})';
const re_port = '(:[0-9]+)?';
const re_path = '([a-z0-9-\._\~\(\)]|%[0-9a-f]{2})+';
const re_query = '(\?(([a-z0-9-\._\~!\$&\'\(\)\*\+,;=:@/\?]|%[0-9a-f]{2})*)?)?';
const re_frag = '(#(([a-z0-9-\._\~!\$&\'\(\)\*\+,;=:@/\?]|%[0-9a-f]{2})*)?)?';
const re_localpart = '[a-z0-9!#\$%&\'*\+-/=\?\^_`{|}\~\.]+';
const re_GraphicFileExts = '\.(png|gif|jpg|jpeg)';

$this->re_href = '~^'.'('.'https?://'.self::re_host.self::re_port.'|)'.'((/'.self::re_path.')*|/?)'.'/?'.self::re_query.self::re_frag.'$~i';

Avinash Raj · Accepted Answer · 2014-06-12 13:34:11Z

0

You could use this regex also.

>>> import re
>>> string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
>>> match = re.search(r'-\s+([^ ]+)\s+GET', string)
>>> match.group(1)
'WWW.WEBSITE.COM'

Breakdown of regex:

-        # a literal -
\s+      # one or more spaces
([^ ]+)  # Matches not of space character one or more times and () helps to store the captured characters into a group. 
\s+      # one or more spaces
GET      # All the above must followed the string GET

answered Jun 12, 2014 at 13:34

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Collectives™ on Stack Overflow

Python regular expression to get URL

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related