Python Regex Google App Engine

Question

I'm using python on GAE

I'm trying to get the following from html

<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>

I want to get everything that will have a "V" followed by 7 or more digits and have behind it.

My regex is

response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)

It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester

How can i do this in the xpath format?

Suku · Accepted Answer · 2014-07-01 04:01:33Z

2

>>> import re

>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'

>>> a = re.search(r'(.*)(V[0-9]{7,})',s)

>>> a.group(2)
'V1068078'

EDIT

(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.

EDIT as @Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'

edited Jul 1, 2014 at 4:01

answered Jul 1, 2014 at 3:25

Suku

3,9101 gold badge24 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3211229 Over a year ago

thanks Suku. If i want to use xpath, what should i do in that case?

Martin Konecny Over a year ago

What's the point of doing a greedy search (.*) at the beginning of the search?

Suku Over a year ago

@MartinKonecny , yea right. We don't need it here. Edited my answer.

hwnd · Accepted Answer · 2014-07-01 04:57:05Z

2

How can I do this in the XPath?

You can use starts-with() here.

>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls  = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'

Or you can use a regular expression

>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls  = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]", 
           namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'

edited Jul 1, 2014 at 4:57

answered Jul 1, 2014 at 4:15

hwnd

70.9k4 gold badges100 silver badges135 bronze badges

Comments

Kaneg · Accepted Answer · 2014-07-01 03:36:43Z

1

Below example can match multiple cases:

import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m

answered Jul 1, 2014 at 3:36

Kaneg

5153 silver badges9 bronze badges

Comments

Martin Konecny · Accepted Answer · 2014-07-01 03:45:12Z

1

The following will work:

result = re.search(r'V\d{7,}',s)
print result.group(0)  # prints 'V1068078'

It will match any string of numeric digit of length 7 or more that follows the letter V

EDIT

If you want it to find all instances, replace search with findall

s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078   V1068078   V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']

edited Jul 1, 2014 at 3:45

answered Jul 1, 2014 at 3:28

Martin Konecny

59.9k20 gold badges144 silver badges159 bronze badges

1 Comment

Martin Konecny Over a year ago

Yes, updated my answer if you want to find more than 1.

Community · Accepted Answer · 2017-05-23 12:29:26Z

For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.

You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.

As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.

If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:

import xml, urllib2
from xml.etree import ElementTree


url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
  print taxon.text

# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo

Avinash Raj · Accepted Answer · 2014-07-01 04:04:39Z

0

And the one without capturing groups.

>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078

answered Jul 1, 2014 at 4:04

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Collectives™ on Stack Overflow

Python Regex Google App Engine

6 Answers 6

3 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related