0

I'm trying to scrape all monetary policy reports on this ECB website here using python's Selenium package. Below is my code:

from selenium import webdriver

CHROME_PATH = <INSERT_CHROME_PATH_HERE>

url = "https://www.ecb.europa.eu/press/govcdec/mopo/html/index.en.html"

xpath = """//*[@id='snippet*']/dd/div[2]/span/a | # xpath of monetary policy report links
//*[@id='snippet1']/dd/div[2]/span/a |
//*[@id='snippet2']/dd/div[2]/span/a |
//*[@id='snippet3']/dd/div[2]/span/a |
//*[@id='snippet4']/dd/div[2]/span/a |
//*[@id='snippet5']/dd/div[2]/span/a |
//*[@id='snippet6']/dd/div[2]/span/a |
//*[@id='snippet7']/dd/div[2]/span/a |
//*[@id='snippet8']/dd/div[2]/span/a |
//*[@id='snippet9']/dd/div[2]/span/a |
//*[@id='snippet10']/dd/div[2]/span/a |
//*[@id='snippet11']/dd/div[2]/span/a |
//*[@id='snippet12']/dd/div[2]/span/a |
//*[@id='snippet13']/dd/div[2]/span/a |
//*[@id='snippet14']/dd/div[2]/span/a |
//*[@id='snippet15']/dd/div[2]/span/a |
//*[@id='snippet16']/dd/div[2]/span/a |
//*[@id='snippet17']/dd/div[2]/span/a |
//*[@id='snippet18']/dd/div[2]/span/a |
//*[@id='snippet19']/dd/div[2]/span/a |
//*[@id='snippet20']/dd/div[2]/span/a |
//*[@id='snippet21']/dd/div[2]/span/a |
//*[@id='snippet22']/dd/div[2]/span/a 
"""

wait_until_selector = "#snippet22 > dd:nth-child(2) > div.ecb-langSelector > span > a" # css selector of last link on page
def get_tags_by_xpath_on_page(
    driver: webdriver.Chrome, wait_until_selector: str, xpath: str
) -> List[str]:

    driver.maximize_window()
    driver.get(url)
    driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);"
    )  # scroll to bottom
    TIMEOUT = 5
    try:
        element_present = EC.presence_of_element_located(
            (By.CSS_SELECTOR, wait_until_selector)
        )
        WebDriverWait(driver, TIMEOUT).until(element_present)
    except TimeoutException:
        print("Timed out waiting for page to load")
    elems = driver.find_elements_by_xpath(xpath)
    tags = [elem.get_attribute("href") for elem in elems]
    return tags
with webdriver.Chrome(CHROME_PATH) as driver:
    tags = get_tags_by_xpath_on_page(driver, wait_until_selector, xpath)

This is currently only capturing links for monetary policy reports from 1999 at the very bottom of the page. How do I fix this code to scrape everything?

1 Answer 1

1

I've gone through the javascript and html and calls after initial page load and realised what you're probably after is the links that look like:

https://www.ecb.europa.eu/press/govcdec/mopo/2019/html/index_include.en.html https://www.ecb.europa.eu/press/govcdec/mopo/2018/html/index_include.en.html https://www.ecb.europa.eu/press/govcdec/mopo/2017/html/index_include.en.html

...

https://www.ecb.europa.eu/press/govcdec/mopo/2012/html/index_include.en.html

2020 and 2021 also return results.

If you look at the URLs loaded after the initial page is loaded (via chrome dev tools under the "Network" tab), when you scroll down, the URLs called follow a fairly obvious pattern.

You could start by searching for GET requests in https://www.ecb.europa.eu/shared/nav/navigation.min.en.json?v=1626262372 and work your way up the call stack to work out that the requests you want are likely the above ones (I wouldn't advise this for beginners).

There's also another javascript response that comes back with a Json response that may be useful. just search through the requests under the network tab and select the "preview" sub tab on any of the loaded items from the initial request. It looks like a lot, but if you deal with the responses one by one it is manageable.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. That’s exactly right re the links I’m looking for. Unfortunately, I’m a beginner here :/
I've done a fair bit of trying to work with dynamic websites now to scrape data and a more direct approach removes the need for tools like selenium which are fairly chunky for such a simple task. Best to avoid if you can as they are less reliable for data scraping than making HTTP requests.
Would you be able to post code / pseudocode for this specific task? I’m not well versed in JS. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.