0

How can I extract the content (how are you) from the string:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. 

Can I use regex for the purpose? if possible whats suitable regex for it.

Note: I dont want to use split function for extract the result. Also can you suggest some links to learn regex for a beginner.

I am using python2.7.2

1
  • 1
    Can the string contain any XML escapes such as '&amp;' or even a CDATA section? If so then you should extract the XML-like bit from the start of the string and use an XML parser. Commented Jan 27, 2012 at 10:49

4 Answers 4

2

You could use a regular expression for this (as Joey demonstrates).

However if your XML document is any bigger than this one-liner you could not since XML is not a regular language.

Use BeautifulSoup (or another XML parser) instead:

>>> from BeautifulSoup import BeautifulSoup
>>> xml_as_str = '<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. '
>>> soup = BeautifulSoup(xml_as_str)
>>> print soup.text
how are you.

Or...

>>> for string_tag in soup.findAll('string'):
...     print string_tag.text
... 
how are you
Sign up to request clarification or add additional context in comments.

2 Comments

Forgive me to correct you here, but a single XML element is definitely regular. XML only becomes non-regular if you introduce element nesting (but the name of the string element doesn't really imply arbitrary nesting, so this might be perfectly feasible). Also, isn't BeautifulSoup for parsing malformed HTML? Better to use an actual XML parser, I guess.
Thanks for the feedback: I've updated my wording. This string is not well-formed XML, either (the full-stop at the end).
0

Try with following regex:

/<[^>]*>(.*?)</

Comments

0
(?<=<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">)[^<]+(?=</string>)

would match what you want, as a trivial example.

(?<=<)[^<]+

would, too. It all depends a bit on how your input is formatted exactly.

Comments

0

This will match a generic HTML tag (Replace "string" with the tag you want to match):

/<string[^<]*>(.*?)<\/string>/i

(i=case insensitive)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.