0

I scraped a webpage using BeautifulSoup, assigned to 'soup'. I can get the text 'Aberdeen' by just adding .text onto the end of 'site_url'.

What I really want to get is the complete url in a string, e.g. "http://www.somewebsite.com/networks/site-info?site_id=ABD"

>>>site_link = soup.find_all('a', string='Aberdeen')[0]
>>>site_row = site_link.findParent('td').findParent('tr')
>>>site_column = site_row.findAll('td')
>>>site_url = site_column[0].contents[0]
>>>print(site_url)

<a href="../networks/site-info?site_id=ABD">Aberdeen</a>

I have not had any luck so far and do not know what else to try. How can I get the url?

3
  • Take a look at THIS. Hope this helps! Commented Aug 15, 2017 at 10:15
  • The page I am trying to scrape is uk-air.defra.gov.uk/latest/currentlevels and I am interested in the urls corresponding to the site names in the first columns of the table e.g. uk-air.defra.gov.uk/networks/site-info?site_id=ACTH for the first name which is Auchencorth Moss Commented Aug 15, 2017 at 10:15
  • @N.Ivanov I have tried something similar but the problem is that there are many different types of links on the page, I just want the said links Commented Aug 15, 2017 at 10:17

2 Answers 2

2

You can use a regular expression to get the links the use urljoin to get the correct URLs.

import requests
import re

try:
    from urlparse import urljoin  # Python2
except ImportError:
    from urllib.parse import urljoin  # Python3

from bs4 import BeautifulSoup
url= 'https://uk-air.defra.gov.uk/latest/currentlevels'
r = requests.get(url, headers={'User-Agent': 'Not blank'})
data = r.text
soup = BeautifulSoup(data, 'html.parser')
for elem in soup('a', href=re.compile(r'site_id')):
    print (elem.text)
    print (urljoin(url,elem['href']))

Outputs:

Auchencorth Moss
https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH
Bush Estate
https://uk-air.defra.gov.uk/networks/site-info?site_id=BUSH
Dumbarton Roadside
https://uk-air.defra.gov.uk/networks/site-info?site_id=DUMB
Edinburgh St Leonards
https://uk-air.defra.gov.uk/networks/site-info?site_id=ED3
Glasgow Great Western Road
https://uk-air.defra.gov.uk/networks/site-info?site_id=GGWR
Glasgow High Street
https://uk-air.defra.gov.uk/networks/site-info?site_id=GHSR
...

If you just want Aberdeen use:

for elem in soup('a',href=re.compile(r'site_id'), string='Aberdeen'):

instead of:

for elem in soup('a', href=re.compile(r'site_id')):

Outputs:

Aberdeen
https://uk-air.defra.gov.uk/networks/site-info?site_id=ABD
Sign up to request clarification or add additional context in comments.

Comments

0

Try this. I hope it will meet all your requirements:

import requests ; from lxml import html

base_link = "https://uk-air.defra.gov.uk"
response = requests.get("https://uk-air.defra.gov.uk/latest/currentlevels", headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}).text
tree = html.fromstring(response)
for title in tree.cssselect("table.current_levels_table td a:not(.smalltext)"):
    print(base_link + title.attrib['href'][2:])

Partial results:

https://uk-air.defra.gov.uk/networks/site-info?site_id=ACTH
https://uk-air.defra.gov.uk/networks/site-info?site_id=BUSH
https://uk-air.defra.gov.uk/networks/site-info?site_id=DUMB
https://uk-air.defra.gov.uk/networks/site-info?site_id=ED3
https://uk-air.defra.gov.uk/networks/site-info?site_id=GGWR
https://uk-air.defra.gov.uk/networks/site-info?site_id=GHSR
https://uk-air.defra.gov.uk/networks/site-info?site_id=GLA4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.