0

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"

The code I have is:

import re

url = re.findall('<tag>(.*)</tag>', str)

print(url)

returns:

[http://example-1.com</tag><tag>http://example-2.com]

If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!

Thanks everyone!

2
  • Use .*? non-greedy instead of .* greedy one Or use [^>]* instead of .* OR best use a HTML parser Commented Apr 1, 2019 at 10:29
  • 1
    Oh wow thanks! That worked perfectly! I'll go read up on greedy and non greedy ones a bit more! I did consider a parser but I wanted to try it in RE since it was a question under that topic. Thank you so much! Commented Apr 1, 2019 at 10:30

2 Answers 2

2

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

You can use BeautifulSoup to parse HTML.

For example:

from bs4 import BeautifulSoup

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
        print tag.text

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! I am going through an online course and wanted to keep it within the topic of RE so I decided to try it with just the RE library (which is still a mystery to me..!)
1

Using only re package:

import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)

returns:

['http://example-1.com', 'http://example-2.com']

Hope it helps!

3 Comments

Thanks! Thank worked! I'm now trying to figure out about greedy and non-greedy and why it would work in this instance.
Here the .*? is matches as few times as possible (lazy), i.e. when it finds the first closing </tag> it stops, whereas in the greedy .* as soon it finds the 2nd closing </tag> it matches the whole pattern.
Oh wow, your explanation made so much more sense than all the tutorials I have been reading/watching! Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.