0

I am trying to request the HTML of a website generated with JavaScript so I can scrape the information with BeautifulSoup. The problem is when I try requesting the HTML, the information I receive is before the page is rendered. The following is the code I am running:

import requests
import urllib.request
from requests_html import HTMLSession

url = 'https://www.edwarddan.com/projects'

this_session = HTMLSession()
response = this_session.get(url)
response.html.render()

print(response.text)
print("---------------------------------------------------------------")

soup = BeautifulSoup(response.text, "html.parser")
names = soup.findAll("div")

As a result, I get the following HTML:

<!doctype html>
<html lang="en">

<head>
  <meta charset="utf-8" />
  <meta name="robots" content="noindex" />
  <script src="/cdn-cgi/apps/head/cTVctQJ-rr0oH623j2V4Pf03v-o.js"></script>
  <link rel="icon" href="/favicon.ico" />
  <meta name="viewport" content="width=device-width,initial-scale=1" />
  <meta name="theme-color" content="#000000" />
  <meta name="description" content="Edward Dan -- Last updated 1/21/2021" />
  <link rel="apple-touch-icon" href="/logo192.png" />
  <link rel="manifest" href="/manifest.json" />
  <title>Edward Dan</title>
  <link href="/static/css/main.d231a676.chunk.css" rel="stylesheet">
</head>

<body><noscript>You need to enable JavaScript to run this app.</noscript>
  <div id="root"></div>
  <script>
    ! function(e) {
      function r(r) {
        for (var n, p, l = r[0], a = r[1], f = r[2], c = 0, s = []; c < l.length; c++) p = l[c], Object.prototype.hasOwnProperty.call(o, p) && o[p] && s.push(o[p][0]), o[p] = 0;
        for (n in a) Object.prototype.hasOwnProperty.call(a, n) && (e[n] = a[n]);
        for (i && i(r); s.length;) s.shift()();
        return u.push.apply(u, f || []), t()
      }

      function t() {
        for (var e, r = 0; r < u.length; r++) {
          for (var t = u[r], n = !0, l = 1; l < t.length; l++) {
            var a = t[l];
            0 !== o[a] && (n = !1)
          }
          n && (u.splice(r--, 1), e = p(p.s = t[0]))
        }
        return e
      }
      var n = {},
        o = {
          1: 0
        },
        u = [];

      function p(r) {
        if (n[r]) return n[r].exports;
        var t = n[r] = {
          i: r,
          l: !1,
          exports: {}
        };
        return e[r].call(t.exports, t, t.exports, p), t.l = !0, t.exports
      }
      p.m = e, p.c = n, p.d = function(e, r, t) {
        p.o(e, r) || Object.defineProperty(e, r, {
          enumerable: !0,
          get: t
        })
      }, p.r = function(e) {
        "undefined" != typeof Symbol && Symbol.toStringTag && Object.defineProperty(e, Symbol.toStringTag, {
          value: "Module"
        }), Object.defineProperty(e, "__esModule", {
          value: !0
        })
      }, p.t = function(e, r) {
        if (1 & r && (e = p(e)), 8 & r) return e;
        if (4 & r && "object" == typeof e && e && e.__esModule) return e;
        var t = Object.create(null);
        if (p.r(t), Object.defineProperty(t, "default", {
            enumerable: !0,
            value: e
          }), 2 & r && "string" != typeof e)
          for (var n in e) p.d(t, n, function(r) {
            return e[r]
          }.bind(null, n));
        return t
      }, p.n = function(e) {
        var r = e && e.__esModule ? function() {
          return e.default
        } : function() {
          return e
        };
        return p.d(r, "a", r), r
      }, p.o = function(e, r) {
        return Object.prototype.hasOwnProperty.call(e, r)
      }, p.p = "/";
      var l = this["webpackJsonpmy-app"] = this["webpackJsonpmy-app"] || [],
        a = l.push.bind(l);
      l.push = r, l = l.slice();
      for (var f = 0; f < l.length; f++) r(l[f]);
      var i = a;
      t()
    }([])
  </script>
  <script src="/static/js/2.c42d857a.chunk.js"></script>
  <script src="/static/js/main.60cdc4af.chunk.js"></script>
</body>

</html>

This HTML produces none of the elements that are the seen when the website has finished generating. I was wondering if there was a way I can wait for the website to finish running JavaScript and create all its elements before fetching said information?

I am using HTMLSession because, from research, I found that it allow websites to load. Specifically, the line response.html.render() should render the page before fetching data, however, it doesn't seem to be working as nothing has rendered.

I have also tried using Selenium in combination with PhantomJS, however, it seems that Selenium is a bit outdated and I would prefer not to use that?

Does anyone know how if there is a way to wait for the JS to finish running with HTMLSession? If not, is there another library I can use that will allow me to have this functionality?

1
  • If you are not satisfied with selenium, then there are other libraries, for example: marionette (firefox), pyppeteer (chrome). You can also manually model and send requests until the page is completely loaded, but this sucks. Commented Jun 8, 2022 at 8:25

1 Answer 1

1

You are using wrong attribute response.text, official website suggests to use response.html after rendering.

Replacement for your code:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

url = 'https://www.edwarddan.com/projects'

this_session = HTMLSession()
response = this_session.get(url)
response.html.render()

print(response.html.raw_html)
print("---------------------------------------------------------------")

soup = BeautifulSoup(response.html.raw_html, "html.parser")
names = soup.findAll("div")

We have to use HTML to parse in BeautifulSoup so, we are using response.html.raw_html to get raw HTML.


And it is better to use response.html.find() function to search rather then using BeautifulSoup because it will be faster, then parsing by BeautifulSoup and searching elements.

Example for your code:

this_session = HTMLSession()
response = this_session.get(url)
response.html.render()

print(response.html.find('div'))
print("---------------------------------------------------------------")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.