I am trying to parse an HTML Page using Regualr Expressions. I have to find out the sum of all comments from this web page: https://py4e-data.dr-chuck.net/comments_42.html Everything else is working fine but the re.findall function is only picking up second digit of a two digit number. I am not able to figure out why is this happening.
This is my code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
code = list()
html = urllib.request.urlopen("https://py4e-data.dr-chuck.net/comments_42.html", context=ctx)
for line in html:
line = line.decode()
line = line.strip()
numbers = re.findall("<span.+([0-9]+)", line)
if len(numbers) != 1: continue
print(numbers)
This is my output: (I am geting 7 instead of 97, 0 instead of 90) output
spantag is split across lines), and probably misparse all sorts of other things (e.g. matching digits that aren't part of thespanyou think you're matching against). Parsing HTML with a regex is evil, and while you may think it's simpler and faster to get something working, it will bite you eventually, because regex are bad for this purpose.