Python Selenium: How to get updated HTML DOM after scrolling down?

Question

I m accessing a page which has implemented parallax scrolling. I am using the code to scroll bottom but BeautifulSoup it is not fetching updated DOM. Code is given below:

import requests
from bs4 import BeautifulSoup
from gensim.summarization import summarize

from selenium import webdriver
from datetime import datetime
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from time import sleep
import sys
import os
import xmltodict
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import traceback
import random

driver = None
driver = webdriver.Firefox()
driver.maximize_window()
def fetch_links(tag):
    links = []
    url = 'https://steemit.com/trending/'+tag
    driver.get(url)
    html = driver.page_source
    sleep(4)

    soup = BeautifulSoup(html,'lxml')
    entries = soup.select('.entry-title > a')
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(5)
    entries = soup.select('.entry-title > a')
    for e in entries:
        if e['href'].strip() not in entries:
            links.append(e['href'])
    return links

Florent B. · Accepted Answer · 2016-08-01 05:37:42Z

2

You probably need to parse the page once the window is scrolled:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
entries = soup.select('.entry-title > a')

answered Aug 1, 2016 at 5:37

Florent B.

42.7k7 gold badges92 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Florent B. Over a year ago

It seems that the issue is with BeautifulSoup. All the titles are present in the html returned by driver.page_source.

Volatil3 Over a year ago

by default it picks 20 records per page, On scroll it should pick next 20

Florent B. Over a year ago

As an alternative you could extract all the links directly with a single JavaScript call: links = driver.execute_script("return [].map.call(document.querySelectorAll(".entry-title > a"), e => e.href)")

Volatil3 Over a year ago

How is it gonna pick the links not yet part of DOM?

Florent B. Over a year ago

From the tests I've executed, the new links are present in the DOM.

|

Collectives™ on Stack Overflow

Python Selenium: How to get updated HTML DOM after scrolling down?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related