1

I have over 19,000 links which I need to visit to scrape data from. Each takes about 5 seconds to fully load, which means that I will need slightly more than 26 hours to scrape everything on a single webdriver.

To me, it seems that a solution is simply to start another webdriver (or few others) in a separate python notebook which goes through another portion of the links in parallel. i.e:

In first iPython notebook:

from selenium import webdriver
driver1 = webdriver.Firefox()
... scraping code looping over links 0-9500 using driver1...

In second iPython notebook:

from selenium import webdriver
driver2 = webdriver.Firefox()
... scraping code looping over links 9501-19000 using driver2...

I'm fairly new to scraping so this question may be completely elementary/ridiculous(?). However, I've tried searching for this and haven't seen anything on the topic, so I would appreciate any advice on this matter. Or any recommendations for a better/more correct way to implement this.

I've heard of multi-threading using the thread module (http://www.tutorialspoint.com/python/python_multithreading.htm), but wonder whether implementing it in this manner would have any advantage over simply creating multiple webdrivers as in the aforementioned code.

1

1 Answer 1

1

You really need to use Selenium in order to do this? Check Scrapy with this framework you can easily send a lots of request and scrape data. Selenium is useful to get browser automation.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks - I have read the book, and was advised to use Selenium because the pages I am seeking to get data from have a lot of javascript which requires processing through a client-side browser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.