2

I am making a web crawler/scraper using Python and Scrapy. Because some websites load their content dynamically, i´m also using Selenium in combination with PhantomJs. Now when i started using this i thought the performance would be acceptable, but turns out it´s quite slow. Now i´m not sure if that is because of some loophole in my code, or because the frameworks/programs i'm using are not optimised enough. So i´m asking you guy´s about suggestions about what i could do to improve the performance.
The code i wrote takes approx. 35 sec to start and end. It´s executing about 11 GET requests and 3 Post requests.

import scrapy
from scrapy.http.request import Request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
import time


class TechcrunchSpider(scrapy.Spider):
    name = "techcrunch_spider_performance"
    allowed_domains = ['techcrunch.com']
    start_urls = ['https://techcrunch.com/search/heartbleed']



    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)
        #self.driver = webdriver.Chrome("C:\Users\Daniel\Desktop\Sonstiges\chromedriver.exe")
        self.driver.wait = WebDriverWait(self.driver, 5)    #wartet bis zu 5 sekunden

    def parse(self, response):
        start = time.time()     #ZEITMESSUNG
        self.driver.get(response.url)

        #wartet bis zu 5 sekunden(oben definiert) auf den eintritt der condition, danach schmeist er den TimeoutException error
        try:    

            self.driver.wait.until(EC.presence_of_element_located(
                (By.CLASS_NAME, "block-content")))
            print("Found : block-content")

        except TimeoutException:
            self.driver.close()
            print(" block-content NOT FOUND IN TECHCRUNCH !!!")


        #Crawle durch Javascript erstellte Inhalte mit Selenium

        ahref = self.driver.find_elements(By.XPATH,'//h2[@class="post-title st-result-title"]/a')

        hreflist = []
        #Alle Links zu den jeweiligen Artikeln sammeln
        for elem in ahref :
            hreflist.append(elem.get_attribute("href"))


        for elem in hreflist :
            print(elem)



        print("im closing myself")
        self.driver.close()
        end = time.time()
        print("Time elapsed : ")
        finaltime = end-start
        print(finaltime)

I am using Windows 8 64bit , intel i7-3630QM CPU @ 2,4GHZ , Nvidia Geforce GT 650M, 8GB Ram.
PS: sorry for German comments

2
  • 1
    You could try generating the AJAX-requests through your spider, thus eliminating the need for Selenium and not needing to wait 5 seconds for the page to be loaded. Check this frequent post. Commented Jun 13, 2017 at 8:14
  • 1
    Read the answer of this question stackoverflow.com/questions/39036137/… Commented Jun 13, 2017 at 8:41

2 Answers 2

2

I was also facing this same issue getting only 2 url processed per minute.

I cache web page by doing this.

......
options = ['--disk-cache=true']
self.driver = webdriver.PhantomJS(service_args=options)
......

This shoot up the url processing from 2 to 11 per minute in case. This may very from web page to web page.

In case, you want to disable image loading to speed up page loading in selenium, add --load-images=false to options above.

Hope it helps.

Sign up to request clarification or add additional context in comments.

Comments

1

Try using Splash to process pages with Javascript instead.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.