Selenium scrolling internal scroll bar and scraping results

Question

I'm trying to scrape this website for my project to populate a list of insurance products available.

However, the website has an internal scrolling bar, that only displays the first 10 items onto the page, and would only bring new elements onto display when you scroll that internal bar downwards.

How do I

Use python Selenium to scroll that internal bar downwards? Can't seem to find much information of that around.
How do I use Selenium to retrieve the Company Name, Product Name, Paymode, product features (if active) and return a pandas Dataframe?

alecxe · Accepted Answer · 2016-09-13 14:28:06Z

2

Interesting thing is, you don't need to scroll the container at all. All the results are actually loaded, but part of them are just invisible. You can simply find all li elements with result_content class and get the desired data.

Example working code extracting the "prod names":

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver


driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.maximize_window()
driver.get("http://comparefirst.sg/wap/productsListEvent.action?prodGroup=whole&pageAction=prodlisting")

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "result_container")))
results = driver.find_elements_by_css_selector("li.result_content")

for result in results:
    prod_name = result.find_element_by_id('sProdName').get_attribute("innerText")
    print(prod_name)

driver.close()

Prints:

AIA Gen3 (II)
AIA Guaranteed Protect Plus
AIA Guaranteed Protect Plus
...
DIRECT- TM Basic Whole Life
DIRECT- TM Basic Whole Life (+ Critical Illness)
TM Legacy
TM Legacy (+ Critical Illness)
TM Legacy LifeFlex
TM Legacy LifeFlex (+ Critical Illness)
TM Retirement GIO
TM Retirement PaycheckLife (Single Life)

Note that we have to use .get_attribute("innerText") instead of .text since the latter would return the visible text only while most of our elements are invisible.

answered Sep 13, 2016 at 14:28

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jake wong Over a year ago

thanks for the quick response! This seems to work wonderfully. But, the name of the company looks to be within <h3 COMPANY NAME </h3> tags. any idea how I can retrieve this as well? Also, any thoughts about how if the product features pictures is active, how do I pick out that information?

alecxe Over a year ago

@jakewong you should be able to locate other fields inside every result using the result.find_element_by_*-like methods. E.g. to get the h3 element: result.find_element_by_tag_name("h3").get_attribute("innerText").

jake wong Over a year ago

Oh, I didn't know you can do that. Thanks. I'll check it out and play around with it. :)

Collectives™ on Stack Overflow

Selenium scrolling internal scroll bar and scraping results

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related