Scraping archived content [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed last month.

Improve this question

I am a bit new to webscraping and trying to build a scraper to collect the title, text, and date from this archived page:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
import pandas as pd
import csv
import sqlite3
import time
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

service = Service()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

url = 'https://webarchive.nla.gov.au/awa/20220427161017/https://www.dfat.gov.au/news/media/Pages/advancing-the-national-interest-call-for-public-submissions'
driver.get(url)


p = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content")))
time.sleep(120)
soup = BeautifulSoup(driver.page_source, 'html.parser')

title = soup.find('h1', class_ = 'au-header-heading').text
print(title)

date = soup.find('time').text
print(date)

news = soup.find('div', class_ = 'paragraph-content')
print(news)

I keep getting the following error message: --> 105 raise TimeoutException(message, screen, stacktrace)

TimeoutException: Message:

I'm not sure how to interpret this as I've built in some waits and sleep timers to ensure that elements that I want to scrape on the page have actually loaded (or so I thought). I'm also including a link to the non-archived site: https://www.dfat.gov.au/news/media-release/memorandum-understanding-between-government-commonwealth-australia-and-government-state-california-united-states-america.

The code above works on this link perfectly (without the built in wait times), so there seems to be an issue with the fact that I'm trying to scrape from the archived site. Any suggestions would be very helpful.

When you debug, which operation is timing out and what debugging have you done for that operation? — David
– David, Commented Sep 30 at 13:51
It seems to be raising this line: 23 p = WebDriverWait(driver, 40).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content"))) — Kaitlin
– Kaitlin, Commented Sep 30 at 13:54
I changed EC.visiblity_of_all_elements_located to visibility_of_element_located and now it still provides a timeout error with a series of stack trace lines. — Kaitlin
– Kaitlin, Commented Sep 30 at 13:59
A timeout on that line seems to imply that the element(s) you're looking for aren't found in the markup. In your debugging, can you examine the markup being used by the code? Is any such matching element there? — David
– David, Commented Sep 30 at 14:03

Shawn · Accepted Answer · 2025-09-30 14:16:32Z

Issue is that the target element is within an IFRAME. In such cases, your driver object needs to get inside the IFRAME to interact with the element.

Here is the working refactored code:

Removed unused imports
Removed unnecessary time.sleep
Effectively used WebDriverWait by creating one object wait and calling it twice

import time

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

service = Service()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

url = 'https://webarchive.nla.gov.au/awa/20220427161017/https://www.dfat.gov.au/news/media/Pages/advancing-the-national-interest-call-for-public-submissions'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver, 20)
# Below line is to switch to iframe if the content is within an iframe
wait.until((EC.frame_to_be_available_and_switch_to_it((By.ID, "replayFrame"))))

p = wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content")))
soup = BeautifulSoup(driver.page_source, 'html.parser')

title = soup.find('h1', class_ = 'au-header-heading').text
print(title)

date = soup.find('time').text
print(date)

news = soup.find('div', class_ = 'paragraph-content')
print(news)

Collectives™ on Stack Overflow

Scraping archived content [closed]

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related