I have a number of different pages (let's say 50) to scrape, and I am using Selenium WebDriver to download each page with Scrapy Downloader Middleware, and then process each page in classical Scrapy manner.
There should be way to make this faster, i.e. not to have to wait for 50 pages to be downloaded by the Selenium driver sequentially, and only then to process them, but perhaps introduce a multiprocessing pool or multiple Selenium drivers to download pages concurrently, as all processing is done only once every page has been downloaded.
However I am not sure how to do this since takes a single request as input to it's process_request method:
def process_request(self, request, spider):
...
self.driver.get(request.url)
...
return HtmlResponse(self.driver.current_url,
body=self.driver.page_source,
encoding='utf-8',
request=request)
In the part of the code that comes before the Middleware, I have something like this:
for item in items:
request = Request(url=...)
yield request
Each of these requests gets sequentially sent to the middleware, so I'm not sure if anything can be done about that, i.e. introduce concurrency at this point.
What can be done to increase the speed of this task?