0

I m accessing a page which has implemented parallax scrolling. I am using the code to scroll bottom but BeautifulSoup it is not fetching updated DOM. Code is given below:

import requests
from bs4 import BeautifulSoup
from gensim.summarization import summarize

from selenium import webdriver
from datetime import datetime
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from time import sleep
import sys
import os
import xmltodict
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import traceback
import random

driver = None
driver = webdriver.Firefox()
driver.maximize_window()
def fetch_links(tag):
    links = []
    url = 'https://steemit.com/trending/'+tag
    driver.get(url)
    html = driver.page_source
    sleep(4)

    soup = BeautifulSoup(html,'lxml')
    entries = soup.select('.entry-title > a')
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(5)
    entries = soup.select('.entry-title > a')
    for e in entries:
        if e['href'].strip() not in entries:
            links.append(e['href'])
    return links

1 Answer 1

2

You probably need to parse the page once the window is scrolled:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
entries = soup.select('.entry-title > a')
Sign up to request clarification or add additional context in comments.

6 Comments

It seems that the issue is with BeautifulSoup. All the titles are present in the html returned by driver.page_source.
by default it picks 20 records per page, On scroll it should pick next 20
As an alternative you could extract all the links directly with a single JavaScript call: links = driver.execute_script("return [].map.call(document.querySelectorAll(".entry-title > a"), e => e.href)")
How is it gonna pick the links not yet part of DOM?
From the tests I've executed, the new links are present in the DOM.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.