0

I'm trying to scrape this website: https://madduxsports.com/college-basketball-lines.php
I'm very new to python and scraping, I believe this website has a table generated with JavaScript.
I'm looking to scrape just the first 7 columns. I've tried

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://madduxsports.com/college-basketball-lines.php")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
script_tags = soup.find_all("script")
print(script_tags)

This will get everything with the <script> tag which has the table data in it but I don't know how to get the first 7 columns.

Thanks for the help

1 Answer 1

1

You could get it through the request directly (but you'll need to do a bit of manipulation of the html escape characters and what not. This gets you the same data as if we pulled it from the <script> tag. I can show you how to get it that way as well if you'd like, but this is a better way in my opinion.

import requests
import pandas as pd

url = 'https://madduxsports.com/newodds/v2/scheduler-ajax.php'
payload = {
'timezone': 'America/New_York',
'is_first_request': '0',
'league_id': '4',
'sport_id': '2',
'period_id': '1'}


jsonData = requests.post(url, data=payload).json()

# Everything above is the to get the data
# jsonData is the json you see in the <script> tag


odds = jsonData['odds']
schedulers = jsonData['schedulers']

odds_df = pd.json_normalize(odds)
schedulers_df = pd.json_normalize(schedulers)

names_dict = {}
for each in odds:
    names_dict[each['id']] = each['name']

cols = []
for col in schedulers_df:
    for k, v in names_dict.items():
        col = col.replace(str(k),v)
        
    cols.append(col)

schedulers_df.columns = cols

cols = ['date','team_ids', 

'team_names','score.away_score','score.home_score',
        'score.description','opener.1.away','opener.1.home']

odds_cols = [x for x in schedulers_df.columns if ('1.away' in x or '1.home' in x) and ('class' not in x)]

df = schedulers_df[cols + odds_cols]

Output:

print(df)
                    date          team_ids  ... odds.SIA.1.away odds.SIA.1.home
0    2021-12-03 00:00:00  306123<br>306124  ...     143&frac12;      -1&frac12;
1    2021-12-03 00:00:00  306127<br>306128  ...     142&frac12;              11
2    2021-12-03 00:00:00  306129<br>306130  ...  126&frac12;u12      -5&frac12;
3    2021-12-03 00:00:00  306131<br>306132  ...              17     146&frac12;
4    2021-12-03 01:00:00  306133<br>306134  ...      -2&frac12;     135&frac12;
..                   ...               ...  ...             ...             ...
107  2021-12-04 07:50:00  396155<br>396156  ...                                
108  2021-12-04 07:50:00  396157<br>396158  ...                                
109  2021-12-04 07:50:00  396159<br>396160  ...                                
110  2021-12-04 07:50:00      9875<br>9876  ...                                
111  2021-12-04 07:50:00      9877<br>9878  ...                                

[112 rows x 22 columns]
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you so much for the answer, but when I run this it seems that there is a lot of data and majority of empty brackets or NaN. Am I doing something wrong?
nothing wrong. That's just what's there. You're viewing all the columns though right? There like 300+ columns. So I wouldnt be suprised that its sparse
Ok thank you, you're probably right. Quick question, why is there 300+ columns when the actual table is only like 11?
it includes alot of the metadata. So things like id, group_type, etc that might be needed to actually render the table on the site. The table on the website is rendered based on the this data, but obviously they choose to only display parts of it that might be deemed important. secondly, the data is in a json format. So it does noeed to be "flattened" out. When you do that, it tends to expand and become very wide as to have each row/instance contain all the relevant info.
Is there a way to filter out all that metadata?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.