Extracting URL from a string

Question

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"

The code I have is:

import re

url = re.findall('<tag>(.*)</tag>', str)

print(url)

returns:

[http://example-1.com</tag><tag>http://example-2.com]

If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!

Thanks everyone!

Use .*? non-greedy instead of .* greedy one Or use [^>]* instead of .* OR best use a HTML parser — Pushpesh Kumar Rajwanshi
– Pushpesh Kumar Rajwanshi, Commented Apr 1, 2019 at 10:29
Oh wow thanks! That worked perfectly! I'll go read up on greedy and non greedy ones a bit more! I did consider a parser but I wanted to try it in RE since it was a question under that topic. Thank you so much! — Cuppy
– Cuppy, Commented Apr 1, 2019 at 10:30

Ion Batîr · Accepted Answer · 2019-04-01 10:31:30Z

2

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

You can use BeautifulSoup to parse HTML.

For example:

from bs4 import BeautifulSoup

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
        print tag.text

edited Apr 1, 2019 at 10:31

answered Apr 1, 2019 at 10:20

Ion Batîr

1711 gold badge1 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cuppy Over a year ago

Thank you very much! I am going through an online course and wanted to keep it within the topic of RE so I decided to try it with just the RE library (which is still a mystery to me..!)

JLD · Accepted Answer · 2019-04-01 10:35:23Z

1

Using only re package:

import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)

returns:

['http://example-1.com', 'http://example-2.com']

Hope it helps!

answered Apr 1, 2019 at 10:35

JLD

995 bronze badges

3 Comments

Cuppy Over a year ago

Thanks! Thank worked! I'm now trying to figure out about greedy and non-greedy and why it would work in this instance.

guroosh Over a year ago

Here the .*? is matches as few times as possible (lazy), i.e. when it finds the first closing </tag> it stops, whereas in the greedy .* as soon it finds the 2nd closing </tag> it matches the whole pattern.

Cuppy Over a year ago

Oh wow, your explanation made so much more sense than all the tutorials I have been reading/watching! Thanks!

Collectives™ on Stack Overflow

Extracting URL from a string

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related