0

I am trying to get Size inside an html page..

Html is

<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2 
file(s)</td>
</tr>

I tried this

size = re.search (r"""<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">.+ in \d
file(s)</td>
</tr>""", Text) 

But i get a None Type.. I only need it to give the 1.64 GB part.. Whar is wrong with it?

2
  • 4
    Why not use a html parser line BeautifulSoup Commented Apr 13, 2018 at 15:55
  • If you must use a regex, why do you need the regex to cover the whole string and not just re.search(">(.+) in \d", Text).group(1) Commented Apr 13, 2018 at 16:18

3 Answers 3

1

BeautifulSoup is a better option for html parsing. However if you want to use regular expression. Here is what you can do.

import re
regex = r"<td.*>\s*(\d+[.]\d+\s+\w+).*<\/td>"
test_str = ("<tr> \n"
    "<td style=\"padding-left: 5px;\" class=\"subheader\"  \n"
    "valign=\"top\" width=\"147\" align=\"right\">Size</td> \n"
    "<td valign=\"top\" style=\"padding-left: 5px;\">1.64 GB in 2  \n"
    "file(s)</td> \n"
    "</tr>")

matches = re.search(regex, test_str, re.DOTALL)
try:
    print(matches.group(1))
except Exception as e:
    print (e)

Output

1.64 GB
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks alot.. This works but i think regex need a little tweak cause I get only part of size.. I. E. I get 1.20 MB while real size is 151.20 MB...
1

It is better idea to parse html using a html parser.

Ex: Using BeautifulSoup

from bs4 import BeautifulSoup
s = """<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2 
file(s)</td>
</tr>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.tr.td.findNext('td').text)
print(re.findall("\d+.\d+ [A-Z]+", soup.tr.td.findNext('td').text.strip()))   #Use regex to get only the required data.

Output:

1.64 GB in 2 
file(s)
[u'1.64 GB']

3 Comments

Thanks for the replies.. But I only need "1.64 GB".. Plus that html is just an extract of a whole html webpage.. I thought using re was easier...and.. The size number always change.
You can parse the string with a regex if that's useful once you get it out of the html--you'll have a much easier time with that than trying to get it out with the regex initially. Not knowing your whole HTML document or what the other td elements like this look like, I wouldn't know how to guide you in constructing the exact regex or the exact way to use beautifulsoup.
Updated snippet. Parsing the html with a parser is a lot easier that just using regex. You can use regex from the text after getting the value out of Beautiful Soup.
1

In general, I would avoid using regexes to parse HTML. It is likely easier for you to use beautifulsoup, or some other similar library. Using beautifulsoup in python:

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup(html, 'html.parser')

In [3]: soup
Out[3]: 
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>
</tr>

In [4]: soup.tr
Out[4]: 
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>
</tr>

In [5]: soup.tr.find_all('td')
Out[5]: 
[<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>,
 <td style="padding-left: 5px;" valign="top">1.64 GB in 2 
 file(s)</td>]

In [6]: soup.tr.find_all('td')[1]
Out[6]: 
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>

In [7]: soup.tr.find_all('td')[1].text
Out[7]: '1.64 GB in 2 \nfile(s)'

If you need a more targeted way of searching the HTML, beautifulsoup provides a number of those.

Once you have the text in question, you can parse that with a regex, or string methods, or however else you'd like to. Not knowing your whole HTML document or what the other td elements like this look like, I wouldn't know how to guide you in constructing the exact regex or the exact way to use beautifulsoup. But this should get you close.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.