I am a bit new to webscraping and trying to build a scraper to collect the title, text, and date from this archived page:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
import pandas as pd
import csv
import sqlite3
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
service = Service()
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)
url = 'https://webarchive.nla.gov.au/awa/20220427161017/https://www.dfat.gov.au/news/media/Pages/advancing-the-national-interest-call-for-public-submissions'
driver.get(url)
p = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "block-region-content")))
time.sleep(120)
soup = BeautifulSoup(driver.page_source, 'html.parser')
title = soup.find('h1', class_ = 'au-header-heading').text
print(title)
date = soup.find('time').text
print(date)
news = soup.find('div', class_ = 'paragraph-content')
print(news)
I keep getting the following error message: --> 105 raise TimeoutException(message, screen, stacktrace)
TimeoutException: Message:
I'm not sure how to interpret this as I've built in some waits and sleep timers to ensure that elements that I want to scrape on the page have actually loaded (or so I thought). I'm also including a link to the non-archived site: https://www.dfat.gov.au/news/media-release/memorandum-understanding-between-government-commonwealth-australia-and-government-state-california-united-states-america.
The code above works on this link perfectly (without the built in wait times), so there seems to be an issue with the fact that I'm trying to scrape from the archived site. Any suggestions would be very helpful.