0

I'm trying to get multiple times words inside html tags. For instace like this:

<title>GateUser UserGate</title>

I want to cath both 'GateUser' and 'UserGate' I'm using the next regexp:

re.sub(ur'(<.*>.*)(\b\w{8}\b)(.*</.*>)', r'\1\g<2>ADDED\3', html)

I would like to replace any word inside html tag that matche this \b\w{8}\b condition, re.sub allows only one.

4
  • 2
    Hand re.sub a fourth parameter: re.GLOBAL. Commented Nov 16, 2016 at 14:29
  • not clear, can you please elaborate a bit? do you want to replace both words, or anything inside the tag ? Commented Nov 16, 2016 at 14:40
  • 1
    Regex and html don't go well together (obligatory link). Why don't you use a html parser to get the text content of the tag, then modify only that? Commented Nov 16, 2016 at 15:10
  • 3
    Thou shall not use regex to parse HTML Commented Nov 16, 2016 at 15:21

1 Answer 1

1

Using re for parsing HTML not really needed as you do have many brilliantly written libraries for that, But still One way you can achieve what you want by:

  • parsing tags.
  • change their innerHtml.

Lets say you have some html:

a = """
  <title>GateUser UserGate</title>
  <div style="something">
    KameHame Ha
  </div>
  """

Now you can relatively easily parse the tags including the innerHtml:

blanks = r"([\s\n\t]+?)"   # totally optional depending on code indentation and stuff.
pat = re.compile(r"(<.+>){0}(.*?){0}(</.+>)".format(blanks))

# tuples don't support item assignment, so mapping list, but still tuples fine too.
tags_with_inner = list(map(list, pat.findall(a)))

# [ ['<title>', '', 'GateUser UserGate', '', '</title>'],
# ['<div style="something">', '\n    ', 'KameHame Ha', '\n  ', '</div>']]

And then match your regex on the inner only:

only_inner = re.compile(r"\b\w{8}\b")  # your expression

for inner in tags_with_inner:
  inner[2] = only_inner.sub("ADDED", inner[2])
  print ("".join(inner))

# <title>ADDED ADDED</title>
# <div style="something">
#     ADDED Ha
#   </div>
Sign up to request clarification or add additional context in comments.

2 Comments

Ok. How can I get original html but with replaced things? This is main thing that wories me
well its better to use html/xml parser modules for that, ou are just making things difficult for you. Try the module lxml on pypi, Its pretty descent.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.