How to handle escaped string in lxml with Python

Question

I'm trying to use lxml to help me parse some XML files and output it. However,there are some special characters in the XML file. I don't want to replace it because it is too complicated to escape it and unescape it. Also I can't force the others to produce a well-formed XML.

Is there any way Python can let me handle the non-well-formed XML with lxml?

I can read it in properly:

  parser = etree.XMLParser(recover=True)
  root = etree.parse(sys.argv[1],parser=parser)

But when I want to print the element text, it can only print the content until the special character occurs.

  for element in root.iter("content"):
    print("%s - %s  attr - %s" % (element.tag, element.text, element.get("name")))

Community · Accepted Answer · 2017-05-23 12:04:35Z

1

lxml does unescaping for you transparently. So you could try to fix invalid characters in the input first and then feed the result to lxml. For example, you could try a simple regex-based solution to escape invalid characters.

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Nov 16, 2012 at 3:46

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

austin · Accepted Answer · 2012-11-16 03:33:17Z

0

A popular option to deal with less-than-perfect markup language files in Python is to use Beautiful Soup. It can use many parsers, including lxml.

Can you post some of the XML that's giving you problems?

answered Nov 16, 2012 at 3:33

austin

5,8862 gold badges35 silver badges40 bronze badges

Collectives™ on Stack Overflow

How to handle escaped string in lxml with Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related