Python regex sub multiple times

Question

I'm trying to get multiple times words inside html tags. For instace like this:

<title>GateUser UserGate</title>

I want to cath both 'GateUser' and 'UserGate' I'm using the next regexp:

re.sub(ur'(<.*>.*)(\b\w{8}\b)(.*</.*>)', r'\1\g<2>ADDED\3', html)

I would like to replace any word inside html tag that matche this \b\w{8}\b condition, re.sub allows only one.

not clear, can you please elaborate a bit? do you want to replace both words, or anything inside the tag ? — Mustofa Rizwan
– Mustofa Rizwan, Commented Nov 16, 2016 at 14:40
Regex and html don't go well together (obligatory link). Why don't you use a html parser to get the text content of the tag, then modify only that? — mata
– mata, Commented Nov 16, 2016 at 15:10

LycuiD · Accepted Answer · 2016-11-16 19:07:28Z

1

Using re for parsing HTML not really needed as you do have many brilliantly written libraries for that, But still One way you can achieve what you want by:

parsing tags.
change their innerHtml.

Lets say you have some html:

a = """
  <title>GateUser UserGate</title>
  <div style="something">
    KameHame Ha
  </div>
  """

Now you can relatively easily parse the tags including the innerHtml:

blanks = r"([\s\n\t]+?)"   # totally optional depending on code indentation and stuff.
pat = re.compile(r"(<.+>){0}(.*?){0}(</.+>)".format(blanks))

# tuples don't support item assignment, so mapping list, but still tuples fine too.
tags_with_inner = list(map(list, pat.findall(a)))

# [ ['<title>', '', 'GateUser UserGate', '', '</title>'],
# ['<div style="something">', '\n    ', 'KameHame Ha', '\n  ', '</div>']]

And then match your regex on the inner only:

only_inner = re.compile(r"\b\w{8}\b")  # your expression

for inner in tags_with_inner:
  inner[2] = only_inner.sub("ADDED", inner[2])
  print ("".join(inner))

# <title>ADDED ADDED</title>
# <div style="something">
#     ADDED Ha
#   </div>

edited Nov 16, 2016 at 19:07

answered Nov 16, 2016 at 18:51

LycuiD

2,5751 gold badge21 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user4251615 Over a year ago

Ok. How can I get original html but with replaced things? This is main thing that wories me

LycuiD Over a year ago

well its better to use html/xml parser modules for that, ou are just making things difficult for you. Try the module lxml on pypi, Its pretty descent.

Collectives™ on Stack Overflow

Python regex sub multiple times

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related