0

I want to remove dangling attributes of html element.

I use regex re.sub(r'(<[\S]+.*\s)[^=]+[\s]', r'\1', x) to find attributes without =.

>>> import re
>>> string_list = ['<tag valid1="o n e" valid2=two some dangling></tag>', '<tag valid1="o n e" valid2=two some dangling/>']
>>> map(lambda x: re.sub(r'(<[\S]+.*\s)[^=]+[\s]', r'\1', x), string_list)
['<tag valid1="o n e" valid2=two dangling></tag>', '<tag valid1="o n e" valid2=two dangling/>']

But this only removes the first. How to repeatedly remove all?

6
  • Trying to parse HTML with regexes is extremely fragile. Using an actual HTML parser is much easier and safer. Commented Jul 3, 2018 at 2:57
  • @user2357112 I would like to use ElementTree to parse it but it only supports xml, which does not allow dangling attributes. That is why I want to do this. Commented Jul 3, 2018 at 3:04
  • 2
    Python comes with an HTML parser, and projects like BeautifulSoup make data extraction even easier. Commented Jul 3, 2018 at 3:11
  • @user2357112 This is not available in 2.7 :( Commented Jul 3, 2018 at 3:12
  • 1
    It's just under a different name on 2.7. Commented Jul 3, 2018 at 3:14

2 Answers 2

1

I choose to use HTMLParser to parse HTML instead of preprocessing HTML and using ElementTree to parse HTML as XML.

Sign up to request clarification or add additional context in comments.

1 Comment

That's nice, but how do you implement your solution? It will be fine to know for future readers.
0

Use re.findall to tokenize the parts.

import re
string_list = ['<div>\n<tag valid1="o n e" valid2=two some dangling></tag>\n<tag valid1="o n e" valid2=two some dangling/>\n</div>', '<tag valid1="o n e"\n valid2=two some dangling></tag>']
for string in string_list:
    output = ''
    for pre, attrs, post in re.findall(r'([^<]*</?\w+)\b(.*?)(/?>[^<]*)', string, re.DOTALL):
        output += pre + ''.join([attr[0] for attr in re.findall(r'(\s+\w+=(?:([\'"]).*?\2|\S+))|\S+', attrs)]) + post
    print(output)

This outputs:

<div>
<tag valid1="o n e" valid2=two></tag>
<tag valid1="o n e" valid2=two/>
</div>
<tag valid1="o n e"
 valid2=two></tag>

9 Comments

></tag> and /> will become >
Updated again then.
What about <div>\n<tag valid1="o n e" valid2=two some dangling></tag>\n<tag valid1="o n e" valid2=two some dangling/>\n</div>?
valid2 is missing. Anyway, thanks a lot for patience. I choose to use HTMLParser to reconstruct html for ElementTree.
string = '<tag valid1="o n e"\n valid2=two some dangling></tag>', producing 'tag valid1="o n e"\n valid2=two some dangling></tag>', is incorrect (missing leading <). Never mind. Maybe it is truly not good to manually parse HTML in text level. :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.