0

I'm trying to use lxml to help me parse some XML files and output it. However,there are some special characters in the XML file. I don't want to replace it because it is too complicated to escape it and unescape it. Also I can't force the others to produce a well-formed XML.

Is there any way Python can let me handle the non-well-formed XML with lxml?

I can read it in properly:

  parser = etree.XMLParser(recover=True)
  root = etree.parse(sys.argv[1],parser=parser)

But when I want to print the element text, it can only print the content until the special character occurs.

  for element in root.iter("content"):
    print("%s - %s  attr - %s" % (element.tag, element.text, element.get("name"))) 

2 Answers 2

1

lxml does unescaping for you transparently. So you could try to fix invalid characters in the input first and then feed the result to lxml. For example, you could try a simple regex-based solution to escape invalid characters.

Sign up to request clarification or add additional context in comments.

Comments

0

A popular option to deal with less-than-perfect markup language files in Python is to use Beautiful Soup. It can use many parsers, including lxml.

Can you post some of the XML that's giving you problems?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.