1

I have a number of different pages (let's say 50) to scrape, and I am using Selenium WebDriver to download each page with Scrapy Downloader Middleware, and then process each page in classical Scrapy manner.

There should be way to make this faster, i.e. not to have to wait for 50 pages to be downloaded by the Selenium driver sequentially, and only then to process them, but perhaps introduce a multiprocessing pool or multiple Selenium drivers to download pages concurrently, as all processing is done only once every page has been downloaded.

However I am not sure how to do this since takes a single request as input to it's process_request method:

def process_request(self, request, spider):
    ...
    self.driver.get(request.url)
    ...
    return HtmlResponse(self.driver.current_url,
                        body=self.driver.page_source, 
                        encoding='utf-8', 
                        request=request) 

In the part of the code that comes before the Middleware, I have something like this:

for item in items:
    request = Request(url=...)
    yield request

Each of these requests gets sequentially sent to the middleware, so I'm not sure if anything can be done about that, i.e. introduce concurrency at this point.

What can be done to increase the speed of this task?

1 Answer 1

2

You could try using docker swarm to spin up a pool of selenium instances, then have the downloader middleware use one of the available instances by passing the name of the instance as a request meta attribute.

Here is an example (although does not integrate scrapy) http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/

Sign up to request clarification or add additional context in comments.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.