12

I'm using the lxml.html library to parse an HTML document.

I located a specific tag, that I call content_tag, and I want to change its content (i.e. the text between <div> and </div>,) and the new content is a string with some html in it, say it's 'Hello <b>world!</b>'.

How do I do that? I tried content_tag.text = 'Hello <b>world!</b>' but then it escapes all the html tags, replacing < with &lt; etc.

I want to inject the text without escaping any HTML. How can I do that?

1
  • The nice way, since you are really trying to modify the DOM structure, would be to add a new child node for world. Commented Aug 11, 2011 at 18:43

3 Answers 3

8

This is one way:

#!/usr/bin/env python2.6
from lxml.html import fromstring, tostring
from lxml.html import builder as E
fragment = """\
<div id="outer">
  <div id="inner">This is div.</div>
</div>"""

div = fromstring(fragment)
print tostring(div)
# <div id="outer">
#   <div id="inner">This is div.</div>
# </div>
div.replace(div.get_element_by_id('inner'), E.DIV('Hello ', E.B('world!')))
print tostring(div)
# <div id="outer">
#   <div>Hello <b>world!</b></div></div>

See also: http://lxml.de/lxmlhtml.html#creating-html-with-the-e-factory

Edit: So, I should have confessed earlier that I'm not all that familiar with lxml. I looked at the docs and source briefly, but didn't find a clean solution. Perhaps, someone more familiar will stop by and set us both straight.

In the meantime, this seems to work, but is not well tested:

import lxml.html
content_tag = lxml.html.fromstring('<div>Goodbye.</div>')
content_tag.text = '' # assumes only text to start
for elem in lxml.html.fragments_fromstring('Hello <b>world!</b>'):
    if type(elem) == str: #but, only the first?
        content_tag.text += elem
    else:
        content_tag.append(elem)
print lxml.html.tostring(content_tag)

Edit again: and this version removes text and children

somehtml = 'Hello <b>world!</b>'
# purge element contents
content_tag.text = ''
for child in content_tag.getchildren():
    content_tag.remove(child)

fragments = lxml.html.fragments_fromstring(somehtml)
if type(fragments[0]) == str:
    content_tag.text = fragments.pop(0)
content_tag.extend(fragments)
Sign up to request clarification or add additional context in comments.

1 Comment

That way doesn't work for me for 2 reasons: (1) I don't want to replace a tag, I want to replace the content of a tag and (2) The html segment that I want to inject is already in text form, I don't want to be building it with E.
0

Assuming content_tag doesn't have any subelement, you can just do:

from lxml import html
from lxml.html.builder import B

...

content_tag.text = 'Hello '
content_tag.append(B('world!'))
print html.tostring(content_tag)

3 Comments

Doesn't help-- My HTML text is not known beforehand and I can't construct it as an HTML structure in the code.
Ahh, but you didn't specify that in your question (the "is not known beforehand" part).
mwalsh's edited answer looks good and should work for arbitrary html.
0

After tinkering around, i found this solution:

fragments = lxml.html.fragments_fromstring(<string with tags to inject>)
last = None

for frag in fragments:
  if isinstance(frag, lxml.etree._Element):
    content_tag.append(frag)
    last = frag
  else:
    if last:
      last.tail = frag
    else:
      content_tag.text = frag

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.