Concurrency with Selenium WebDriver in Python Scrapy

Question

I have a number of different pages (let's say 50) to scrape, and I am using Selenium WebDriver to download each page with Scrapy Downloader Middleware, and then process each page in classical Scrapy manner.

There should be way to make this faster, i.e. not to have to wait for 50 pages to be downloaded by the Selenium driver sequentially, and only then to process them, but perhaps introduce a multiprocessing pool or multiple Selenium drivers to download pages concurrently, as all processing is done only once every page has been downloaded.

However I am not sure how to do this since takes a single request as input to it's process_request method:

def process_request(self, request, spider):
    ...
    self.driver.get(request.url)
    ...
    return HtmlResponse(self.driver.current_url,
                        body=self.driver.page_source, 
                        encoding='utf-8', 
                        request=request)

In the part of the code that comes before the Middleware, I have something like this:

for item in items:
    request = Request(url=...)
    yield request

Each of these requests gets sequentially sent to the middleware, so I'm not sure if anything can be done about that, i.e. introduce concurrency at this point.

What can be done to increase the speed of this task?

acowpy · Accepted Answer · 2017-12-21 01:40:21Z

2

You could try using docker swarm to spin up a pool of selenium instances, then have the downloader middleware use one of the available instances by passing the name of the instance as a request meta attribute.

Here is an example (although does not integrate scrapy) http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/

answered Dec 21, 2017 at 1:40

acowpy

4964 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Arthur Julião Over a year ago

This may be the link: danielfrg.com/blog/2015/09/crawling-python-selenium-docker

Collectives™ on Stack Overflow

Concurrency with Selenium WebDriver in Python Scrapy

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related