2

I am making a project where I want to see the average karma of users on various subreddits on Reddit. As such I am in the process of scraping users karma, which is proving a bit difficult with the new reddit structure.

I am not able to use PRAW as the karma figures there are not correct.

According to the page source of a users all I need is to find the following two variables: commentKarma and postKarma. Both of these variables are found under the "" section, see example here view-source:https://www.reddit.com/user/loganb3171. However, when I use selenium page_source or beautifulsoup they do not show up.

I have been working on this problem for a couple of hours now and I am nowhere near it.

Any and all help is appreciated.

either of these snippets does not give me the entire pagesource as you get when right clicking "view page source"

source_var = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

source_var=driver.page_source
3
  • 2
    Please include the relevant code you have so far that is not working. That might help someone help you. Commented Jul 1, 2018 at 9:23
  • This isn't an exact solution, but rather a suggestion. I've scraped reddit recently, and found that the old version of the website is much more simpler to scrape than the newer one, which is heavily dependent on JavaScript. For example, the old version of the link you posted is old.reddit.com/user/loganb3171 and you can see the karma of the user, right beneath the name. I'll try to scrape it off of the new site, but keep this in mind unless there's a specific reason you don't want to scrape the old site. Also keep in mind to use headers when scraping because reddit hates bots. Commented Jul 1, 2018 at 9:24
  • Yeah the problem is that I am sure they will force the new website on everyone soon, and I don't want my code to work for like a week as this project will take around 6 months to complete Commented Jul 1, 2018 at 9:26

1 Answer 1

1

Okay, so I see that you're using selenium from the snippet in the question. If that's the case, then there's no way to set request headers with the web driver. Reddit will know you are a bot.

If you only need the page source, you can use requests to get the page and open it with selenium or use BeautifulSoup to parse the page

from bs4 import BeautifulSoup
import requests

url = "https://www.reddit.com/user/loganb3171"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.text, 'html.parser')

print(soup.prettify())
Sign up to request clarification or add additional context in comments.

6 Comments

Awesome, this seems to be working. I have a follow up question, you say: Okay, so I see that you're using selenium from the snippet in the question. If that's the case, then there's no way to set request headers with the web driver. Reddit will know you are a bot. Why is that?
@J.Doe This is because whenever you send a request to a server, the server will read the headers of the request to see, who's sending the request or, through what the request came from. If you don't manually set the headers to look somewhat like a browser's, Reddit will assume the request came from a bot, which it did.
I see, but can't I just change the headers for selenium? I also use selenium to browse on my own?
Selenium uses a web driver to allow automating the web browser. When you use a web driver, the browser itself is a mutated copy of the original browser. If you're using the chrome web driver, it's technically called headless chrome. Selenium is aimed towards automating testing of your own web servers. By nature, headless chrome doesn't have any headers to its requests. Since you're sending the request through the web driver (headless chrome), it won't include any headers. You can get around this by using a proxy which adds headers, but that's not reliable, and quite a long solution.
Thanks a bunch! Do you have a starting point on where to go regarding using a proxy? Reason being that I am also using selenium as my own browser, as it is easier for me to change my IP proxy to another country
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.