1

I am trying to get a URL out of a long string and I am unsure how write the regex;

$ string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'

I am trying to use the 're.search' function in order to pull out the WWW.WEBSITE.COM only without spaces. I would like it look like this;

$ get_site = re.search(regex).group()

$ print get_site

$ WWW.WEBSITE.COM
9
  • 1
    try this WWW\.(.*)\.COM. I am no expert in regex that's why I am commenting. Probably someone else can provide a better non-greedy one ? Commented Jun 11, 2014 at 20:41
  • 1
    Your question is pretty vague. Why shouldn't it return the IP address? It's just as valid as a URL. Commented Jun 11, 2014 at 20:42
  • hi Vipul, thanks for your input. I actually need a more robust method for this. Not all of my sites will start with WWW or anything like that, BUT they will all be in between a (-) and the (GET)!! Commented Jun 11, 2014 at 20:42
  • ip's in my dataset are for the individual user Commented Jun 11, 2014 at 20:43
  • Maybe this helps: stackoverflow.com/questions/6038061/… Commented Jun 11, 2014 at 20:44

4 Answers 4

7

BUT they will all be in between a (-) and the (GET)

That is all the information you need:

>>> import re
>>> string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
>>> re.search('-\s+(.+?)\s+GET', string).group(1)
'WWW.WEBSITE.COM'
>>>

Below is a breakdown of what the Regex pattern is matching:

-      # -
\s+    # One or more spaces
(.+?)  # A capture group for one or more characters
\s+    # One or more spaces
GET    # GET

Note too that .group(1) gets the text captured by (.+?). .group() would return the entire match:

>>> re.search('-\s+(.+?)\s+GET', string).group()
'- WWW.WEBSITE.COM GET'
>>>
Sign up to request clarification or add additional context in comments.

Comments

0

WWW\.(.+)\.[A-Z]{2,3}

WWW        #WWW
\.         #dot
(.+)       #one or more arbitrary characters
\.         #dot, again
[A-Z]{2,3} #two or three alphabetic uppercase characters (as there are .eu domain, for example)

1 Comment

That regex won't cover subdomains, IP addresses and tld's longer than 3 such as info.
0

I wrote the following regex a while ago for a PHP project, its based on the dedicated RFC so it will cover any valid URL. I remember I tested it extensively too, so it should be reliable.

const re_host = '(([a-z0-9-]+\.)+[a-z]+|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3})';
const re_port = '(:[0-9]+)?';
const re_path = '([a-z0-9-\._\~\(\)]|%[0-9a-f]{2})+';
const re_query = '(\?(([a-z0-9-\._\~!\$&\'\(\)\*\+,;=:@/\?]|%[0-9a-f]{2})*)?)?';
const re_frag = '(#(([a-z0-9-\._\~!\$&\'\(\)\*\+,;=:@/\?]|%[0-9a-f]{2})*)?)?';
const re_localpart = '[a-z0-9!#\$%&\'*\+-/=\?\^_`{|}\~\.]+';
const re_GraphicFileExts = '\.(png|gif|jpg|jpeg)';

$this->re_href = '~^'.'('.'https?://'.self::re_host.self::re_port.'|)'.'((/'.self::re_path.')*|/?)'.'/?'.self::re_query.self::re_frag.'$~i';

Comments

0

You could use this regex also.

>>> import re
>>> string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
>>> match = re.search(r'-\s+([^ ]+)\s+GET', string)
>>> match.group(1)
'WWW.WEBSITE.COM'

Breakdown of regex:

-        # a literal -
\s+      # one or more spaces
([^ ]+)  # Matches not of space character one or more times and () helps to store the captured characters into a group. 
\s+      # one or more spaces
GET      # All the above must followed the string GET 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.