Removing html tags from a text using Regular Expression in python

Question

I'm trying to look at a html file and remove all the tags from it so that only the text is left but I'm having a problem with my regex. This is what I have so far.

import urllib.request, re
def test(url):
html = str(urllib.request.urlopen(url).read())
print(re.findall('<[\w\/\.\w]*>',html))

The html is a simple page with a few links and text but my regex won't pick up !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" and 'a href="...." tags. Can anyone explain what I need to change in my regex?

Problems parsing HTML with regex, you say? Why, I can scarcely believe it! Who would have thought! What a turn up for the books! PS. BeautifulSoup. — bobince
– bobince, Commented Jan 29, 2010 at 23:29
Stay calm, bobince. Breathe slowly into the paper bag. In, out, in out, ... stackoverflow.com/questions/1732348/… — hughdbrown
– hughdbrown, Commented Jan 30, 2010 at 0:02
I love the regularity with which these questions appear. It's like the "Find Similar Questions" part of the new question form doesn't work :D — Alastair Pitts
– Alastair Pitts, Commented Jan 30, 2010 at 0:31
If you spend a little time on SO you'll find that there are about infinity billion better ways to parse HTML and regex is not one of them. — jathanism
– jathanism, Commented Jan 30, 2010 at 0:34
Yes, if you are dealing with the vanishingly-small subset of HTML documents in the universe whose formatting you have perfect knowledge of. — Robert Rossney
– Robert Rossney, Commented Jan 30, 2010 at 10:35

Community · Accepted Answer · 2017-05-23 12:16:55Z

15

Use BeautifulSoup. Use lxml. Do not use regular expressions to parse HTML.

Edit 2010-01-29: This would be a reasonable starting point for lxml:

from lxml.html import fromstring
from lxml.html.clean import Cleaner
import requests

url = "https://stackoverflow.com/questions/2165943/removing-html-tags-from-a-text-using-regular-expression-in-python"
html = requests.get(url).text

doc = fromstring(html)

tags = ['h1','h2','h3','h4','h5','h6',
       'div', 'span', 
       'img', 'area', 'map']
args = {'meta':False, 'safe_attrs_only':False, 'page_structure':False, 
       'scripts':True, 'style':True, 'links':True, 'remove_tags':tags}
cleaner = Cleaner(**args)

path = '/html/body'
body = doc.xpath(path)[0]

print cleaner.clean_html(body).text_content().encode('ascii', 'ignore')

You want the content, so presumably you don't want any javascript or CSS. Also, presumably you want only the content in the body and not HTML from the head, too. Read up on lxml.html.clean to see what you can easily strip out. Way smarter than regular expressions, no?

Also, watch out for unicode encoding problems. You can easily end up with HTML that you cannot print.

2012-11-08: changed from using urllib2 to requests. Just use requests!

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Jan 30, 2010 at 0:01

hughdbrown

49.2k20 gold badges89 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ghostdog74 Over a year ago

-1. OP's requirement is simple, remove all tags. There's no need for BeautifulSoup.

hughdbrown Over a year ago

Here's a couple of things the OP might consider obvious but has omitted from the question: document section (head and body? body only?) and javascript (does the OP consider javascript part of the content?). Those are going to be easily controllable with BeautifulSoup and lxml. Regular expressions will not deal with those at all.

ghostdog74 · Accepted Answer · 2010-01-30 00:29:14Z

-1

import re
patjunk = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)
url="http://www.yahoo.com"
def test(url,pat):
    html = urllib2.urlopen(url).read()
    return pat.sub("",html)

print test(url,patjunk)

answered Jan 30, 2010 at 0:29

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

1 Comment

mlissner Over a year ago

I believe this will handle all HTML entities: '&(([a-z]{1,5})|(#\d{1,4}));'

Collectives™ on Stack Overflow

Removing html tags from a text using Regular Expression in python

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related