Python regular expression in html

Question

I am trying to get Size inside an html page..

Html is

<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2 
file(s)</td>
</tr>

I tried this

size = re.search (r"""<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">.+ in \d
file(s)</td>
</tr>""", Text)

But i get a None Type.. I only need it to give the 1.64 GB part.. Whar is wrong with it?

If you must use a regex, why do you need the regex to cover the whole string and not just re.search(">(.+) in \d", Text).group(1) — Nick is tired
– Nick is tired, Commented Apr 13, 2018 at 16:18

Sumit Jha · Accepted Answer · 2018-04-13 16:39:40Z

1

BeautifulSoup is a better option for html parsing. However if you want to use regular expression. Here is what you can do.

import re
regex = r"<td.*>\s*(\d+[.]\d+\s+\w+).*<\/td>"
test_str = ("<tr> \n"
    "<td style=\"padding-left: 5px;\" class=\"subheader\"  \n"
    "valign=\"top\" width=\"147\" align=\"right\">Size</td> \n"
    "<td valign=\"top\" style=\"padding-left: 5px;\">1.64 GB in 2  \n"
    "file(s)</td> \n"
    "</tr>")

matches = re.search(regex, test_str, re.DOTALL)
try:
    print(matches.group(1))
except Exception as e:
    print (e)

Output

1.64 GB

edited Apr 13, 2018 at 16:39

answered Apr 13, 2018 at 16:10

Sumit Jha

1,69913 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

maurizio de ruggiero Over a year ago

Thanks alot.. This works but i think regex need a little tweak cause I get only part of size.. I. E. I get 1.20 MB while real size is 151.20 MB...

Rakesh · Accepted Answer · 2018-04-13 16:07:21Z

1

It is better idea to parse html using a html parser.

Ex: Using BeautifulSoup

from bs4 import BeautifulSoup
s = """<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2 
file(s)</td>
</tr>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.tr.td.findNext('td').text)
print(re.findall("\d+.\d+ [A-Z]+", soup.tr.td.findNext('td').text.strip()))   #Use regex to get only the required data.

Output:

1.64 GB in 2 
file(s)
[u'1.64 GB']

edited Apr 13, 2018 at 16:07

answered Apr 13, 2018 at 16:01

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

3 Comments

maurizio de ruggiero Over a year ago

Thanks for the replies.. But I only need "1.64 GB".. Plus that html is just an extract of a whole html webpage.. I thought using re was easier...and.. The size number always change.

ryanmrubin Over a year ago

You can parse the string with a regex if that's useful once you get it out of the html--you'll have a much easier time with that than trying to get it out with the regex initially. Not knowing your whole HTML document or what the other td elements like this look like, I wouldn't know how to guide you in constructing the exact regex or the exact way to use beautifulsoup.

Rakesh Over a year ago

Updated snippet. Parsing the html with a parser is a lot easier that just using regex. You can use regex from the text after getting the value out of Beautiful Soup.

ryanmrubin · Accepted Answer · 2018-04-13 16:07:24Z

In general, I would avoid using regexes to parse HTML. It is likely easier for you to use beautifulsoup, or some other similar library. Using beautifulsoup in python:

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup(html, 'html.parser')

In [3]: soup
Out[3]: 
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>
</tr>

In [4]: soup.tr
Out[4]: 
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>
</tr>

In [5]: soup.tr.find_all('td')
Out[5]: 
[<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>,
 <td style="padding-left: 5px;" valign="top">1.64 GB in 2 
 file(s)</td>]

In [6]: soup.tr.find_all('td')[1]
Out[6]: 
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>

In [7]: soup.tr.find_all('td')[1].text
Out[7]: '1.64 GB in 2 \nfile(s)'

If you need a more targeted way of searching the HTML, beautifulsoup provides a number of those.

Once you have the text in question, you can parse that with a regex, or string methods, or however else you'd like to. Not knowing your whole HTML document or what the other td elements like this look like, I wouldn't know how to guide you in constructing the exact regex or the exact way to use beautifulsoup. But this should get you close.

Collectives™ on Stack Overflow

Python regular expression in html

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related