How to repeatedly replace regex in python?

Question

I want to remove dangling attributes of html element.

I use regex re.sub(r'(<[\S]+.*\s)[^=]+[\s]', r'\1', x) to find attributes without =.

>>> import re
>>> string_list = ['<tag valid1="o n e" valid2=two some dangling></tag>', '<tag valid1="o n e" valid2=two some dangling/>']
>>> map(lambda x: re.sub(r'(<[\S]+.*\s)[^=]+[\s]', r'\1', x), string_list)
['<tag valid1="o n e" valid2=two dangling></tag>', '<tag valid1="o n e" valid2=two dangling/>']

But this only removes the first. How to repeatedly remove all?

Trying to parse HTML with regexes is extremely fragile. Using an actual HTML parser is much easier and safer. — user2357112
– user2357112, Commented Jul 3, 2018 at 2:57
@user2357112 I would like to use ElementTree to parse it but it only supports xml, which does not allow dangling attributes. That is why I want to do this. — MoYummy
– MoYummy, Commented Jul 3, 2018 at 3:04
Python comes with an HTML parser, and projects like BeautifulSoup make data extraction even easier. — user2357112
– user2357112, Commented Jul 3, 2018 at 3:11

MoYummy · Accepted Answer · 2018-07-03 09:51:58Z

1

I choose to use HTMLParser to parse HTML instead of preprocessing HTML and using ElementTree to parse HTML as XML.

answered Jul 3, 2018 at 9:51

MoYummy

8391 gold badge8 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Toto Over a year ago

That's nice, but how do you implement your solution? It will be fine to know for future readers.

blhsing · Accepted Answer · 2018-07-03 13:15:40Z

0

Use re.findall to tokenize the parts.

import re
string_list = ['<div>\n<tag valid1="o n e" valid2=two some dangling></tag>\n<tag valid1="o n e" valid2=two some dangling/>\n</div>', '<tag valid1="o n e"\n valid2=two some dangling></tag>']
for string in string_list:
    output = ''
    for pre, attrs, post in re.findall(r'([^<]*</?\w+)\b(.*?)(/?>[^<]*)', string, re.DOTALL):
        output += pre + ''.join([attr[0] for attr in re.findall(r'(\s+\w+=(?:([\'"]).*?\2|\S+))|\S+', attrs)]) + post
    print(output)

This outputs:

<div>
<tag valid1="o n e" valid2=two></tag>
<tag valid1="o n e" valid2=two/>
</div>
<tag valid1="o n e"
 valid2=two></tag>

edited Jul 3, 2018 at 13:15

answered Jul 3, 2018 at 3:06

blhsing

109k9 gold badges88 silver badges132 bronze badges

9 Comments

MoYummy Over a year ago

></tag> and /> will become >

blhsing Over a year ago

Updated again then.

MoYummy Over a year ago

What about <div>\n<tag valid1="o n e" valid2=two some dangling></tag>\n<tag valid1="o n e" valid2=two some dangling/>\n</div>?

MoYummy Over a year ago

valid2 is missing. Anyway, thanks a lot for patience. I choose to use HTMLParser to reconstruct html for ElementTree.

MoYummy Over a year ago

string = '<tag valid1="o n e"\n valid2=two some dangling></tag>', producing 'tag valid1="o n e"\n valid2=two some dangling></tag>', is incorrect (missing leading <). Never mind. Maybe it is truly not good to manually parse HTML in text level. :)

|

Collectives™ on Stack Overflow

How to repeatedly replace regex in python?

2 Answers 2

1 Comment

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related