I'm trying to scrape all monetary policy reports on this ECB website here using python's Selenium package. Below is my code:
from selenium import webdriver
CHROME_PATH = <INSERT_CHROME_PATH_HERE>
url = "https://www.ecb.europa.eu/press/govcdec/mopo/html/index.en.html"
xpath = """//*[@id='snippet*']/dd/div[2]/span/a | # xpath of monetary policy report links
//*[@id='snippet1']/dd/div[2]/span/a |
//*[@id='snippet2']/dd/div[2]/span/a |
//*[@id='snippet3']/dd/div[2]/span/a |
//*[@id='snippet4']/dd/div[2]/span/a |
//*[@id='snippet5']/dd/div[2]/span/a |
//*[@id='snippet6']/dd/div[2]/span/a |
//*[@id='snippet7']/dd/div[2]/span/a |
//*[@id='snippet8']/dd/div[2]/span/a |
//*[@id='snippet9']/dd/div[2]/span/a |
//*[@id='snippet10']/dd/div[2]/span/a |
//*[@id='snippet11']/dd/div[2]/span/a |
//*[@id='snippet12']/dd/div[2]/span/a |
//*[@id='snippet13']/dd/div[2]/span/a |
//*[@id='snippet14']/dd/div[2]/span/a |
//*[@id='snippet15']/dd/div[2]/span/a |
//*[@id='snippet16']/dd/div[2]/span/a |
//*[@id='snippet17']/dd/div[2]/span/a |
//*[@id='snippet18']/dd/div[2]/span/a |
//*[@id='snippet19']/dd/div[2]/span/a |
//*[@id='snippet20']/dd/div[2]/span/a |
//*[@id='snippet21']/dd/div[2]/span/a |
//*[@id='snippet22']/dd/div[2]/span/a
"""
wait_until_selector = "#snippet22 > dd:nth-child(2) > div.ecb-langSelector > span > a" # css selector of last link on page
def get_tags_by_xpath_on_page(
driver: webdriver.Chrome, wait_until_selector: str, xpath: str
) -> List[str]:
driver.maximize_window()
driver.get(url)
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
) # scroll to bottom
TIMEOUT = 5
try:
element_present = EC.presence_of_element_located(
(By.CSS_SELECTOR, wait_until_selector)
)
WebDriverWait(driver, TIMEOUT).until(element_present)
except TimeoutException:
print("Timed out waiting for page to load")
elems = driver.find_elements_by_xpath(xpath)
tags = [elem.get_attribute("href") for elem in elems]
return tags
with webdriver.Chrome(CHROME_PATH) as driver:
tags = get_tags_by_xpath_on_page(driver, wait_until_selector, xpath)
This is currently only capturing links for monetary policy reports from 1999 at the very bottom of the page. How do I fix this code to scrape everything?