How do I scrape dynamically loading website with scrolling using python Selenium

Question

I'm trying to scrape all monetary policy reports on this ECB website here using python's Selenium package. Below is my code:

from selenium import webdriver

CHROME_PATH = <INSERT_CHROME_PATH_HERE>

url = "https://www.ecb.europa.eu/press/govcdec/mopo/html/index.en.html"

xpath = """//*[@id='snippet*']/dd/div[2]/span/a | # xpath of monetary policy report links
//*[@id='snippet1']/dd/div[2]/span/a |
//*[@id='snippet2']/dd/div[2]/span/a |
//*[@id='snippet3']/dd/div[2]/span/a |
//*[@id='snippet4']/dd/div[2]/span/a |
//*[@id='snippet5']/dd/div[2]/span/a |
//*[@id='snippet6']/dd/div[2]/span/a |
//*[@id='snippet7']/dd/div[2]/span/a |
//*[@id='snippet8']/dd/div[2]/span/a |
//*[@id='snippet9']/dd/div[2]/span/a |
//*[@id='snippet10']/dd/div[2]/span/a |
//*[@id='snippet11']/dd/div[2]/span/a |
//*[@id='snippet12']/dd/div[2]/span/a |
//*[@id='snippet13']/dd/div[2]/span/a |
//*[@id='snippet14']/dd/div[2]/span/a |
//*[@id='snippet15']/dd/div[2]/span/a |
//*[@id='snippet16']/dd/div[2]/span/a |
//*[@id='snippet17']/dd/div[2]/span/a |
//*[@id='snippet18']/dd/div[2]/span/a |
//*[@id='snippet19']/dd/div[2]/span/a |
//*[@id='snippet20']/dd/div[2]/span/a |
//*[@id='snippet21']/dd/div[2]/span/a |
//*[@id='snippet22']/dd/div[2]/span/a 
"""

wait_until_selector = "#snippet22 > dd:nth-child(2) > div.ecb-langSelector > span > a" # css selector of last link on page

def get_tags_by_xpath_on_page(
    driver: webdriver.Chrome, wait_until_selector: str, xpath: str
) -> List[str]:

    driver.maximize_window()
    driver.get(url)
    driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);"
    )  # scroll to bottom
    TIMEOUT = 5
    try:
        element_present = EC.presence_of_element_located(
            (By.CSS_SELECTOR, wait_until_selector)
        )
        WebDriverWait(driver, TIMEOUT).until(element_present)
    except TimeoutException:
        print("Timed out waiting for page to load")
    elems = driver.find_elements_by_xpath(xpath)
    tags = [elem.get_attribute("href") for elem in elems]
    return tags

with webdriver.Chrome(CHROME_PATH) as driver:
    tags = get_tags_by_xpath_on_page(driver, wait_until_selector, xpath)

This is currently only capturing links for monetary policy reports from 1999 at the very bottom of the page. How do I fix this code to scrape everything?

Rob Evans · Accepted Answer · 2021-07-22 23:28:16Z

1

I've gone through the javascript and html and calls after initial page load and realised what you're probably after is the links that look like:

https://www.ecb.europa.eu/press/govcdec/mopo/2019/html/index_include.en.html https://www.ecb.europa.eu/press/govcdec/mopo/2018/html/index_include.en.html https://www.ecb.europa.eu/press/govcdec/mopo/2017/html/index_include.en.html

...

https://www.ecb.europa.eu/press/govcdec/mopo/2012/html/index_include.en.html

2020 and 2021 also return results.

If you look at the URLs loaded after the initial page is loaded (via chrome dev tools under the "Network" tab), when you scroll down, the URLs called follow a fairly obvious pattern.

You could start by searching for GET requests in https://www.ecb.europa.eu/shared/nav/navigation.min.en.json?v=1626262372 and work your way up the call stack to work out that the requests you want are likely the above ones (I wouldn't advise this for beginners).

There's also another javascript response that comes back with a Json response that may be useful. just search through the requests under the network tab and select the "preview" sub tab on any of the loaded items from the initial request. It looks like a lot, but if you deal with the responses one by one it is manageable.

edited Jul 22, 2021 at 23:28

answered Jul 22, 2021 at 23:17

Rob Evans

2,8841 gold badge11 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Christian Adib Over a year ago

Thanks. That’s exactly right re the links I’m looking for. Unfortunately, I’m a beginner here :/

Rob Evans Over a year ago

I've done a fair bit of trying to work with dynamic websites now to scrape data and a more direct approach removes the need for tools like selenium which are fairly chunky for such a simple task. Best to avoid if you can as they are less reliable for data scraping than making HTTP requests.

Christian Adib Over a year ago

Would you be able to post code / pseudocode for this specific task? I’m not well versed in JS. Thank you!

Collectives™ on Stack Overflow

How do I scrape dynamically loading website with scrolling using python Selenium

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related