0

I'm trying to extract integers from a url with bs4. I imported re to get the numbers but I get the above error. I'm confused and would appreciate some help.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    re.findall('<span.*[0-9].*',tag)

Link http://py4e-data.dr-chuck.net/comments_314936.html
Output expected: Print the numbers from the link

6
  • Add some output and link for us to be able to help from there. Commented Nov 12, 2019 at 5:11
  • were going to type the same... Commented Nov 12, 2019 at 5:11
  • py4e-data.dr-chuck.net/comments_314936.html Commented Nov 12, 2019 at 5:16
  • That's the link. Commented Nov 12, 2019 at 5:17
  • It's http link, why are you using ssl? Commented Nov 12, 2019 at 5:26

2 Answers 2

2

You can get the number directly by using .get_text(). And I have removed unnecessary code.

from urllib.request import urlopen
from bs4 import BeautifulSoup


url = 'http://py4e-data.dr-chuck.net/comments_314936.html'
html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    print(tag.get_text())

Output:

100
98
93
91
.
.
.
Sign up to request clarification or add additional context in comments.

Comments

1

'tag' is returned as a bs4.element.tag
That has to be received as a string to search within that.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('span')

for tag in tags:
    word = re.findall('(\d+)',str(tag), re.I)
    word = ''.join(word)
    print(word)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.