2

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal html websites. But when I tried to get some data out of websites where the data loads after some delay, I found that I get an empty value. An example would be

from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver
url = "https://www.example.com/;1"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('span', 'buy')
print(a)

I am trying to grab the from here: (value)

I have already referred a similar topic and tried executing my code on similar lines as the solution provided here. But somehow it doesnt seem to work. I am a novice here so need help getting this work. How to scrape html table only after data loads using Python Requests?

The table (content) is probably generated by JavaScript and thus can't be "seen". I am using python3.6 / PhantomJS / Selenium as proposed by a lot of answers here.

12
  • You can use some of this selenium-python.readthedocs.io/waits.html or just add time.sleep(n) Commented Oct 4, 2017 at 20:43
  • can you please check the url? It seems the ; there is a typo and might be causing the error in your scraper Commented Oct 4, 2017 at 20:45
  • @AndMar time.sleep doesnt seem to work in this case. Please suggest where exactly you propose for me to add? Commented Oct 4, 2017 at 20:57
  • @jabargas same code works if i just change soup.find('span', 'buy') to soup.find('span', 'btc') which is just static content instead of dynamic content that gets loaded in a few seconds after the page loads. So i doubt there is any issue with url. Commented Oct 4, 2017 at 20:59
  • Try browser.implicitly_wait(n) where n is an integer for the amount of seconds. Commented Oct 4, 2017 at 21:00

2 Answers 2

3

You have to run headless browser to run delayed scraping. Please use selenium. Here is sample code. Code is using chrome browser as driver

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome(<chromedriver path here>)
browser.set_window_size(1120, 550)
browser.get(link)
element = WebDriverWait(browser, 3).until(
   EC.presence_of_element_located((By.ID, "blabla"))
)
data = element.get_attribute('data-blabla')
print(data)
browser.quit()
Sign up to request clarification or add additional context in comments.

2 Comments

it works like a charm using chrome webdriver. But it actually opens up the browser window. Instead is there something similar with headless browser? Maybe if you have similar code for phantomjs or so that doesnt open a physical browser but works under the hood sort of a console windows or so? Thanks again. Once I get your response i will mark this post as answered.
Please replace webdriver.chorme(<chromedirver path>) to webdriver.PhantomJS(<phantomjs driver path>). All other process is same.
0

You can access desired values by requesting it directly from API and analyze JSON response.

import requests
import json

res = request.get('https://api.example.com/api/')
d = json.loads(res.text)

print(d['market'])

1 Comment

Thanks for the response. Although an api will do for this site. The original idea was to still understand how to get the value scraped on such site where there is a slight delay in data loading. That is the key ask from this post.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.