re.findall picking up only second digit of a two digit number in a web page [duplicate]

Question

I am trying to parse an HTML Page using Regualr Expressions. I have to find out the sum of all comments from this web page: https://py4e-data.dr-chuck.net/comments_42.html Everything else is working fine but the re.findall function is only picking up second digit of a two digit number. I am not able to figure out why is this happening.

This is my code:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
code = list()
html = urllib.request.urlopen("https://py4e-data.dr-chuck.net/comments_42.html", context=ctx)
for line in html:
    line = line.decode()
    line = line.strip()
    numbers = re.findall("<span.+([0-9]+)", line)
    if len(numbers) != 1: continue
    print(numbers)

This is my output: (I am geting 7 instead of 97, 0 instead of 90) output

As a side-note: Why are you parsing HTML with a regex? You've imported BeautifulSoup. This regex you've written will fail to parse all kinds of things it should match (e.g. if the span tag is split across lines), and probably misparse all sorts of other things (e.g. matching digits that aren't part of the span you think you're matching against). Parsing HTML with a regex is evil, and while you may think it's simpler and faster to get something working, it will bite you eventually, because regex are bad for this purpose. — ShadowRanger
– ShadowRanger, Commented Aug 11, 2023 at 15:57

ShadowRanger · Accepted Answer · 2023-08-11 15:58:57Z

Regexes are greedy by default (not just in Python, in basically every regex system I'm aware of), so they try to take as many characters as possible for each variable length match (e.g. * and +) in the regex, from left to right, so long as they can still match with what remains. As such, the .+ in <span.+([0-9]+) is matching all the characters save the very last one (which must be left to match [0-9]+), so [0-9]+ can never match more than one.

You can solve this in various ways:

If the characters between span and the desired digits will never be digits themselves, only match non-digits instead of ., e.g. r"<span[^0-9]+([0-9]+)" (note: I used an r prefix to make that a raw string, which you should always do with Python regex literals to avoid issues with string escapes overlapping regex escapes; it would allow you to safely use \D and \d in place of [^0-9] and [0-9] respectively if you liked, and weren't concerned with non-ASCII digits). The regex is still greedy, and should perform equally well, but it will stop at the first run of digits and capture all of them, rather than capturing only the final digit of the last run of digits.
If they might be digits, and you want to capture the last digits, make the .+ non-greedy by changing the regex to r"<span.+?([0-9]+)". The ? after the + means "match the fewest characters possible", rather than the default greedy "match as many as possible". It will typically make the regex run a little slower, but not enough to matter in most cases.

Collectives™ on Stack Overflow

re.findall picking up only second digit of a two digit number in a web page [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related