25

I have some xml pieces like this:

<!DOCTYPE mensaje SYSTEM "record.dtd">
<record>
    <player_birthday>1979-09-23</player_birthday>
    <player_name>Orene Ai'i</player_name>
    <player_team>Blues</player_team>
    <player_id>453</player_id>
    <player_height>170</player_height>
    <player_position>F&W</player_position>   <---- a '&' here.
    <player_weight>75</player_weight>
</record>

Is there any way to validate whether the xml pieces is well-formatted? Is there any way to validate the xml against a DTD or XML Scheme?

For various reasons I can't use any third-party packages.

e.g. the xml above is not conrrect since it has a '&' in it. Note that the DOCTYPE definition sentence refer to a DTD.

1
  • I consider it risky, to violate XML on token level (level-0) and hope to find a tool, which checks for level-1 compliance. The probability to find one is not higher in first-party tools. If I count correctly in the backtrace, the answer of jsbueno fails due to that. Why is replacing by "&amp;" not an option? Commented Dec 6, 2012 at 13:15

2 Answers 2

41

Just try to parse it with ElementTree (xml.etree.ElementTree.fromstring) - it will raise an error if the XML is not well formed.

>>> a = """<record>
...     <player_birthday>1979-09-23</player_birthday>
...     <player_name>Orene Ai'i</player_name>
...     <player_team>Blues</player_team>
...     <player_id>453</player_id>
...     <player_height>170</player_height>
...     <player_position>F&W</player_position>   <---- a '&' here.
...     <player_weight>75</player_weight>
... </record>"""
>>> 
>>> from xml.etree import ElementTree as ET
>>> x = ET.fromstring(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1282, in XML
    parser.feed(text)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 24
Sign up to request clarification or add additional context in comments.

2 Comments

How to avoid warning FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead. if ElementTree.fromstring(output_payload): ?
This is due to nature of xml.etree that makes testing Element objects for truthiness dependent on what's inside them (so elements that have no subelements will evaluate to False). That's why they write that a specific test for truthiness is needed (if elem is not None and not if elem). They decided to change this behavior. You can suppress warnings using: with open(os.devnull, "w") as devnull: and then with contextlib.redirect_stderr(devnull): .
9

You can use python's xml.dom.minidom XML parser (which is in the standard library, but isn't as powerful as alternatives such as lxml).

Just do:

import xml.dom.minidom
xml.dom.minidom.parseString('<My><XML><String/><XML/><My/>')

You will get a xml.parsers.expat.ExpatError if the XML is invalid.

3 Comments

Minidom is no longer the prefered way of parsing MXL in standard Python (although it won't matter in this specific case, unless performance matters)
You may want to correct the XML spelling; by the way: what is the preferred way now?
@guidot jsbueno suggested the use of ElementTree in his own answer which is actually more powerful than minidom and should indeed be used! If you have access to non-standard libraries, lxml probably is the best out there!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.