Python GET request after DOM manipulation by Javascript

Question

I'm looking at the source code of this link as it was read by the browser. The problem is that the DOM seems to be manipulated by Javascript (For example the calendar).

How can I get the page after loading so I can access the Javascript generated calendar?

I wish to get this result

<table class="table-bordered daily>

I've tried this code with no luck

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.matchi.se/facilities/abybadminton?date=2020-10-17&sport=')
soup = BeautifulSoup(page.content, 'html.parser')


for eachRow in soup.find_all('table'):
    print(eachRow)

To get the code after JavaScript manipulation, you need to run the JavaScript interpreter. You cannot do that with request, because it just downloads data and gives it to you as such (same as what browser gets, but browser then runs the scripts). What you want is to use some library which implements the whole browser or uses an existing browser. — zvone
– zvone, Commented Oct 17, 2020 at 0:09
There might be something simpler out there, but one solution that I know of is to use Selenium. It will open up the page in Chrome, and then return the HTML to you. Selenium also allows you to interact with Chrome. — hostingutilities.com
– hostingutilities.com, Commented Oct 17, 2020 at 0:56
Selenium looks easier as you can then use classes to add in colour should you make it a graphic. You can reconstruct the available/booked division from one of the xhr requests the page makes but it is a bit of a faff. — QHarr
– QHarr, Commented Oct 17, 2020 at 2:09

Andrej Kesely · Accepted Answer · 2020-10-17 06:45:57Z

1

The page is making request to external URL via JavaScript. You can use requests to load this information:

import re
import requests
from bs4 import BeautifulSoup

date = '2020-10-17'

main_url = 'https://www.matchi.se/facilities/abybadminton?date={date}&sport='
html_doc = requests.get(main_url.format(date=date)).text

sport_id = re.search(r"var sport = '(.*?)'", html_doc).group(1)
facility_id = re.search(r'facilityId: "(.*?)"', html_doc).group(1)

ajax_url = 'https://www.matchi.se/book/schedule'

params = {
    'wl': '',
    'facilityId': facility_id,
    'date': date,
    'sport': sport_id,
    'week': '',
    'year': ''    
}

soup = BeautifulSoup( requests.get(ajax_url, params=params).content, 'html.parser' )

# print occupied slots:
for td in soup.select('td.slot.red'):
    title = BeautifulSoup(td['title'], 'html.parser').get_text(strip=True, separator=' ')
    print(title)

Prints:

Booked Bana 1 11:00 - 12:00
Booked Bana 2 10:00 - 11:00
Booked Bana 2 11:00 - 12:00
Booked Bana 2 12:00 - 13:00
Booked Bana 3 12:00 - 13:00
Booked Bana 3 14:00 - 15:00
Booked Bana 4 11:00 - 12:00
Booked Bana 5 11:00 - 12:00
Booked Bana 5 14:00 - 15:00
Booked Bana 6 11:00 - 12:00
Booked Bana 6 12:00 - 13:00
Booked Bana 7 11:00 - 12:00
Booked Bana 7 12:00 - 13:00
Booked Bana 7 14:00 - 15:00
Booked Bana 7 15:00 - 16:00
Booked Bana 8 10:00 - 11:00
Booked Bana 9 14:00 - 15:00
Booked Bana 10 12:00 - 13:00
Booked Bana 10 15:00 - 16:00
Booked Bana 13 11:00 - 12:00
Booked Bana 14 10:00 - 11:00
Booked Bana 15 10:00 - 11:00
Booked Bana 15 18:00 - 19:00
Booked Bana 16 13:00 - 14:00

answered Oct 17, 2020 at 6:45

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MasterSmack Over a year ago

Works great! May I ask how you recognized that there was an external request to that specific URL?

Andrej Kesely Over a year ago

@MasterSmack I've looked to Firefox developer tools -> Network tab. There are all requests that the page is doing.

Collectives™ on Stack Overflow

Python GET request after DOM manipulation by Javascript

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related