6

I'm trying to look at a html file and remove all the tags from it so that only the text is left but I'm having a problem with my regex. This is what I have so far.

import urllib.request, re
def test(url):
html = str(urllib.request.urlopen(url).read())
print(re.findall('<[\w\/\.\w]*>',html))

The html is a simple page with a few links and text but my regex won't pick up !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" and 'a href="...." tags. Can anyone explain what I need to change in my regex?

7
  • 5
    Problems parsing HTML with regex, you say? Why, I can scarcely believe it! Who would have thought! What a turn up for the books! PS. BeautifulSoup. Commented Jan 29, 2010 at 23:29
  • Stay calm, bobince. Breathe slowly into the paper bag. In, out, in out, ... stackoverflow.com/questions/1732348/… Commented Jan 30, 2010 at 0:02
  • I love the regularity with which these questions appear. It's like the "Find Similar Questions" part of the new question form doesn't work :D Commented Jan 30, 2010 at 0:31
  • If you spend a little time on SO you'll find that there are about infinity billion better ways to parse HTML and regex is not one of them. Commented Jan 30, 2010 at 0:34
  • 1
    Yes, if you are dealing with the vanishingly-small subset of HTML documents in the universe whose formatting you have perfect knowledge of. Commented Jan 30, 2010 at 10:35

2 Answers 2

15

Use BeautifulSoup. Use lxml. Do not use regular expressions to parse HTML.


Edit 2010-01-29: This would be a reasonable starting point for lxml:

from lxml.html import fromstring
from lxml.html.clean import Cleaner
import requests

url = "https://stackoverflow.com/questions/2165943/removing-html-tags-from-a-text-using-regular-expression-in-python"
html = requests.get(url).text

doc = fromstring(html)

tags = ['h1','h2','h3','h4','h5','h6',
       'div', 'span', 
       'img', 'area', 'map']
args = {'meta':False, 'safe_attrs_only':False, 'page_structure':False, 
       'scripts':True, 'style':True, 'links':True, 'remove_tags':tags}
cleaner = Cleaner(**args)

path = '/html/body'
body = doc.xpath(path)[0]

print cleaner.clean_html(body).text_content().encode('ascii', 'ignore')

You want the content, so presumably you don't want any javascript or CSS. Also, presumably you want only the content in the body and not HTML from the head, too. Read up on lxml.html.clean to see what you can easily strip out. Way smarter than regular expressions, no?

Also, watch out for unicode encoding problems. You can easily end up with HTML that you cannot print.


2012-11-08: changed from using urllib2 to requests. Just use requests!

Sign up to request clarification or add additional context in comments.

2 Comments

-1. OP's requirement is simple, remove all tags. There's no need for BeautifulSoup.
Here's a couple of things the OP might consider obvious but has omitted from the question: document section (head and body? body only?) and javascript (does the OP consider javascript part of the content?). Those are going to be easily controllable with BeautifulSoup and lxml. Regular expressions will not deal with those at all.
-1
import re
patjunk = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)
url="http://www.yahoo.com"
def test(url,pat):
    html = urllib2.urlopen(url).read()
    return pat.sub("",html)

print test(url,patjunk)

1 Comment

I believe this will handle all HTML entities: '&(([a-z]{1,5})|(#\d{1,4}));'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.