Python & web scraping performance

Question

I am trying to do some python based web scraping where execution time is pretty critical.

I've tried phantomjs, selenium, and pyqt4 now, and all three libraries have given me similar response times. I'd post example code, but my problem affects all three, so I believe the problem either lies in a shared dependency or outside of my code. At around 50 concurrent requests, we see a huge desegregation in response time. It takes about 40 seconds to get back all 50 pages, and that time gets exponentially slower with greater page demands. Ideally I'm looking for ~200+ requests in about 10 seconds. I used multiprocessing to spawn each instance of phantonjs/pyqt4/selenium, so each url request gets it's own instance so that I'm not blocked by single threading.

I don't believe it's a hardware bottleneck, it's running on 32 dedicated cpu cores, totaling to 64 threads, and cpu usage doesn't typically spike to over 10-12%. Bandwidth as well sits reasonably comfortably at around 40-50% of my total throughput.

I've read about the GIL, which I believe I've addressed with using multiprocessing. Is webscraping just an inherently slow thing? Should I stop expecting to pull 200ish webpages in ~10 seconds?

My overall question is, what is the best approach to high performance web scraping, where evaluating js on the webpage is a requirement?

Scrapy, Splash, Frontera

amarynets
– amarynets

2017-09-13 19:30:44 +00:00
Commented Sep 13, 2017 at 19:30 — amarynets
– amarynets, Commented Sep 13, 2017 at 19:30

Irmen de Jong · Accepted Answer · 2017-09-13 19:26:39Z

1

"evaluating js on the webpage is a requirement" <- I think this is your problem right here. Simply downloading 50 web pages is fairly trivially parallelized and should only take as long as the slowest server takes to respond. Now, spawning 50 javascript engines in parallel (which is essentially what I guess it is you are doing) to run the scripts on every page is a different matter. Imagine firing up 50 chrome browsers at the same time.

Anyway: profile and measure the parts of your application to find where the bottleneck lies. Only then you can see if you're dealing with an I/O bottleneck (sounds unlikely), a CPU bottleneck (more likely) or a global lock somewhere that serializes stuff (also likely but impossible to say without any code posted)

answered Sep 13, 2017 at 19:26

Irmen de Jong

2,8771 gold badge17 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python & web scraping performance

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related