I have over 19,000 links which I need to visit to scrape data from. Each takes about 5 seconds to fully load, which means that I will need slightly more than 26 hours to scrape everything on a single webdriver.
To me, it seems that a solution is simply to start another webdriver (or few others) in a separate python notebook which goes through another portion of the links in parallel. i.e:
In first iPython notebook:
from selenium import webdriver
driver1 = webdriver.Firefox()
... scraping code looping over links 0-9500 using driver1...
In second iPython notebook:
from selenium import webdriver
driver2 = webdriver.Firefox()
... scraping code looping over links 9501-19000 using driver2...
I'm fairly new to scraping so this question may be completely elementary/ridiculous(?). However, I've tried searching for this and haven't seen anything on the topic, so I would appreciate any advice on this matter. Or any recommendations for a better/more correct way to implement this.
I've heard of multi-threading using the thread module (http://www.tutorialspoint.com/python/python_multithreading.htm), but wonder whether implementing it in this manner would have any advantage over simply creating multiple webdrivers as in the aforementioned code.