python .replace() regex [duplicate]

Question

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))

Warning: parsing HTML with regular expressions leads to madness. — Adam Rosenfield
– Adam Rosenfield, Commented Jul 13, 2012 at 18:08
I have a bunch of garbage after my closing html tag and I just want to remove it. — user1442957
– user1442957, Commented Jul 13, 2012 at 18:11
But what if your HTML has a quoted string, comment, JavaScript, or CDATA containing </html>? Or what if the garbage at the end itself has a </html>? Unless you can guarantee that none of those etc. can happen, you either need to fully parse the HTML or have some other way of knowing how much data you have (e.g. a Content-Length: HTTP header). — Adam Rosenfield
– Adam Rosenfield, Commented Jul 13, 2012 at 18:16

Flame · Accepted Answer · 2023-01-12 14:07:41Z

883

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

str_output = re.sub(regex_search_term, regex_replacement, str_input)

edited Jan 12, 2023 at 14:07

Flame

7,7563 gold badges43 silver badges64 bronze badges

answered Jul 13, 2012 at 18:05

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user1442957 Over a year ago

How would I apply the re model to my 'article' variable?

user1442957 Over a year ago

I tried the following to no avail z.write(re.sub(r'</html>.+', r'</html>', article))

MRAB Over a year ago

Is the tag not lowercase, or is it followed by a '\n'? You can make it case-insensitive ((?i) flag) and make . match newlines ((?s) flag) with r'(?is)</html>.+'.

parvus Over a year ago

Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern.

Oksana Sgibneva Sep 25 at 14:30

thank you. Yes, it worked for me. I used it in my script.

Andre Pena · Accepted Answer · 2020-02-06 04:40:36Z

118

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'

edited Feb 6, 2020 at 4:40

answered Jan 3, 2017 at 16:02

Andre Pena

59.7k53 gold badges213 silver badges257 bronze badges

Comments

Julian · Accepted Answer · 2012-07-13 19:01:50Z

9

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

answered Jul 13, 2012 at 19:01

Julian

2,70223 silver badges21 bronze badges

5 Comments

Daniel Griscom Over a year ago

Not so clean; you have to hard-code the length of "</html>".

Edgard Knive Over a year ago

@DanielGriscom : what about len(str('</html>')) ?

Daniel Griscom Over a year ago

@OleAnders Better, but then you're duplicating that string, which opens another possibility for error.

Daniel Griscom Over a year ago

@OleAnders ... and just realized; no need for the str(); just use len('</html>')

Julian Over a year ago

I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish.

norio · Accepted Answer · 2017-06-24 20:08:09Z

9

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>

answered Jun 24, 2017 at 20:08

norio

3,9423 gold badges29 silver badges36 bronze badges

Collectives™ on Stack Overflow

python .replace() regex [duplicate]

4 Answers 4

5 Comments

Comments

5 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

5 Comments

Comments

Linked

Related