scraping dynamic JavaScript table with python

Question

I'm trying to scrape this website: https://madduxsports.com/college-basketball-lines.php
I'm very new to python and scraping, I believe this website has a table generated with JavaScript.
I'm looking to scrape just the first 7 columns. I've tried

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://madduxsports.com/college-basketball-lines.php")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
script_tags = soup.find_all("script")
print(script_tags)

This will get everything with the <script> tag which has the table data in it but I don't know how to get the first 7 columns.

Thanks for the help

chitown88 · Accepted Answer · 2021-12-03 09:33:29Z

1

You could get it through the request directly (but you'll need to do a bit of manipulation of the html escape characters and what not. This gets you the same data as if we pulled it from the <script> tag. I can show you how to get it that way as well if you'd like, but this is a better way in my opinion.

import requests
import pandas as pd

url = 'https://madduxsports.com/newodds/v2/scheduler-ajax.php'
payload = {
'timezone': 'America/New_York',
'is_first_request': '0',
'league_id': '4',
'sport_id': '2',
'period_id': '1'}


jsonData = requests.post(url, data=payload).json()

# Everything above is the to get the data
# jsonData is the json you see in the <script> tag


odds = jsonData['odds']
schedulers = jsonData['schedulers']

odds_df = pd.json_normalize(odds)
schedulers_df = pd.json_normalize(schedulers)

names_dict = {}
for each in odds:
    names_dict[each['id']] = each['name']

cols = []
for col in schedulers_df:
    for k, v in names_dict.items():
        col = col.replace(str(k),v)
        
    cols.append(col)

schedulers_df.columns = cols

cols = ['date','team_ids', 

'team_names','score.away_score','score.home_score',
        'score.description','opener.1.away','opener.1.home']

odds_cols = [x for x in schedulers_df.columns if ('1.away' in x or '1.home' in x) and ('class' not in x)]

df = schedulers_df[cols + odds_cols]

Output:

print(df)
                    date          team_ids  ... odds.SIA.1.away odds.SIA.1.home
0    2021-12-03 00:00:00  306123<br>306124  ...     143&frac12;      -1&frac12;
1    2021-12-03 00:00:00  306127<br>306128  ...     142&frac12;              11
2    2021-12-03 00:00:00  306129<br>306130  ...  126&frac12;u12      -5&frac12;
3    2021-12-03 00:00:00  306131<br>306132  ...              17     146&frac12;
4    2021-12-03 01:00:00  306133<br>306134  ...      -2&frac12;     135&frac12;
..                   ...               ...  ...             ...             ...
107  2021-12-04 07:50:00  396155<br>396156  ...                                
108  2021-12-04 07:50:00  396157<br>396158  ...                                
109  2021-12-04 07:50:00  396159<br>396160  ...                                
110  2021-12-04 07:50:00      9875<br>9876  ...                                
111  2021-12-04 07:50:00      9877<br>9878  ...                                

[112 rows x 22 columns]

edited Dec 3, 2021 at 9:33

answered Dec 1, 2021 at 17:26

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

e417927 Over a year ago

Thank you so much for the answer, but when I run this it seems that there is a lot of data and majority of empty brackets or NaN. Am I doing something wrong?

chitown88 Over a year ago

nothing wrong. That's just what's there. You're viewing all the columns though right? There like 300+ columns. So I wouldnt be suprised that its sparse

e417927 Over a year ago

Ok thank you, you're probably right. Quick question, why is there 300+ columns when the actual table is only like 11?

chitown88 Over a year ago

it includes alot of the metadata. So things like id, group_type, etc that might be needed to actually render the table on the site. The table on the website is rendered based on the this data, but obviously they choose to only display parts of it that might be deemed important. secondly, the data is in a json format. So it does noeed to be "flattened" out. When you do that, it tends to expand and become very wide as to have each row/instance contain all the relevant info.

e417927 Over a year ago

Is there a way to filter out all that metadata?

|

Collectives™ on Stack Overflow

scraping dynamic JavaScript table with python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related