525

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))
4
  • 83
    Warning: parsing HTML with regular expressions leads to madness. Commented Jul 13, 2012 at 18:08
  • 6
    I have a bunch of garbage after my closing html tag and I just want to remove it. Commented Jul 13, 2012 at 18:11
  • 1
    But what if your HTML has a quoted string, comment, JavaScript, or CDATA containing </html>? Or what if the garbage at the end itself has a </html>? Unless you can guarantee that none of those etc. can happen, you either need to fully parse the HTML or have some other way of knowing how much data you have (e.g. a Content-Length: HTTP header). Commented Jul 13, 2012 at 18:16
  • 16
    none of those things are a factor. Commented Jul 13, 2012 at 18:19

4 Answers 4

883

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

str_output = re.sub(regex_search_term, regex_replacement, str_input)
Sign up to request clarification or add additional context in comments.

5 Comments

How would I apply the re model to my 'article' variable?
I tried the following to no avail z.write(re.sub(r'</html>.+', r'</html>', article))
Is the tag not lowercase, or is it followed by a '\n'? You can make it case-insensitive ((?i) flag) and make . match newlines ((?s) flag) with r'(?is)</html>.+'.
Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern.
thank you. Yes, it worked for me. I used it in my script.
118

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'

Comments

9

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

5 Comments

Not so clean; you have to hard-code the length of "</html>".
@DanielGriscom : what about len(str('</html>')) ?
@OleAnders Better, but then you're duplicating that string, which opens another possibility for error.
@OleAnders ... and just realized; no need for the str(); just use len('</html>')
I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish.
9

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.