1

I am a bit new to webscraping and trying to build a scraper to collect the title, text, and date from this archived page:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
import pandas as pd
import csv
import sqlite3
import time
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

service = Service()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

url = 'https://webarchive.nla.gov.au/awa/20220427161017/https://www.dfat.gov.au/news/media/Pages/advancing-the-national-interest-call-for-public-submissions'
driver.get(url)


p = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content")))
time.sleep(120)
soup = BeautifulSoup(driver.page_source, 'html.parser')

title = soup.find('h1', class_ = 'au-header-heading').text
print(title)

date = soup.find('time').text
print(date)

news = soup.find('div', class_ = 'paragraph-content')
print(news)

I keep getting the following error message: --> 105 raise TimeoutException(message, screen, stacktrace)

TimeoutException: Message:

I'm not sure how to interpret this as I've built in some waits and sleep timers to ensure that elements that I want to scrape on the page have actually loaded (or so I thought). I'm also including a link to the non-archived site: https://www.dfat.gov.au/news/media-release/memorandum-understanding-between-government-commonwealth-australia-and-government-state-california-united-states-america.

The code above works on this link perfectly (without the built in wait times), so there seems to be an issue with the fact that I'm trying to scrape from the archived site. Any suggestions would be very helpful.

6
  • When you debug, which operation is timing out and what debugging have you done for that operation? Commented Sep 30 at 13:51
  • It seems to be raising this line: 23 p = WebDriverWait(driver, 40).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content"))) Commented Sep 30 at 13:54
  • I changed EC.visiblity_of_all_elements_located to visibility_of_element_located and now it still provides a timeout error with a series of stack trace lines. Commented Sep 30 at 13:59
  • A timeout on that line seems to imply that the element(s) you're looking for aren't found in the markup. In your debugging, can you examine the markup being used by the code? Is any such matching element there? Commented Sep 30 at 14:03
  • THe archived page is loading in an iframe to be waited for Commented Sep 30 at 14:08

1 Answer 1

4

Issue is that the target element is within an IFRAME. In such cases, your driver object needs to get inside the IFRAME to interact with the element.

Here is the working refactored code:

  1. Removed unused imports
  2. Removed unnecessary time.sleep
  3. Effectively used WebDriverWait by creating one object wait and calling it twice
import time

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

service = Service()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

url = 'https://webarchive.nla.gov.au/awa/20220427161017/https://www.dfat.gov.au/news/media/Pages/advancing-the-national-interest-call-for-public-submissions'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver, 20)
# Below line is to switch to iframe if the content is within an iframe
wait.until((EC.frame_to_be_available_and_switch_to_it((By.ID, "replayFrame"))))

p = wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content")))
soup = BeautifulSoup(driver.page_source, 'html.parser')

title = soup.find('h1', class_ = 'au-header-heading').text
print(title)

date = soup.find('time').text
print(date)

news = soup.find('div', class_ = 'paragraph-content')
print(news)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.