17

I am scraping some websites that seem to have pretty good protection against it. The only way I can get it to work is to use Selenium to load the page and then scrape stuff from that.

Currently this works on my local computer (a firefox windows opens and closed when I access my page and it's HTML is processed further in my script). However, I need my scraper to be accessible on the web. The scraper is embedded within a Flask app on Heroku. Is there a way to make the Selenium browser work on Heroku servers? Or are there any hosting providers where it can work?

2 Answers 2

16

Heroku, wonderful as it is, has a major limitation in that one cannot use custom software or in many cases, libraries. In providing an easy to use, centrally-controlled, managed stack, Heroku strips their servers down to prevent other usage.

What this boils down to is there is no Xorg on a Heroku dyno. Lack of Xorg and lack of ability to install custom software means no xvfb either, and no ability to run the browser that selenium expects to exist. Further, the browser is not generally available.

You'll have better luck with a cloud offering like AWS, where you can install custom software, including firefox, xvfb (to keep from needing all the Xorg overhead), and of course the rest of your scraping stack. This answer explains how to do it properly.

Sign up to request clarification or add additional context in comments.

Comments

7

There are buildpacks to make selenium work on heroku.

Add below buildpacks.

1) heroku buildpacks:add https://github.com/kevinsawicki/heroku-buildpack-xvfb-google-chrome/
2) heroku buildpacks:add https://github.com/heroku/heroku-buildpack-chromedriver

And set heroku stack to cedar-14 as shown below, as xvfb buildpack works only with cedar-14.

heroku stack:set cedar-14 -a stocksdata

Then point the google chrome location as below

options = ChromeOptions()
options.binary_location = "/app/.apt/usr/bin/google-chrome-stable"
driver = webdriver.Chrome(chrome_options=options)

1 Comment

Your advice worked for me. In my case I manually uploaded chromedriver into applications bin/ directory and used heroku buildpacks:add https://github.com/heroku/heroku-buildpack-google-chrome instead of xvfb for headless mode.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.