I am trying to pull the table data from this website - 'https://understat.com/league/EPL' When I viewed the Source code, the table is saved in a . I want to know how to extract the data from the script in a usable format.
I tried using the solution from a similar question (How to Get Script Tag Variables From a Website using Python):
import requests
import bs4
import json
url = 'https://understat.com/league/EPL'
r = requests.get(url)
bs = bs4.BeautifulSoup(r.text, "html.parser")
scripts = bs.find_all('script')
for s in scripts:
if 'var datesData' in s.text:
script = s.text
print(script)
However, nothing is getting printed, that is, it can't find 'var datesData' in the script, but when I just print(scripts), I get:
[<script>
var THEME = localStorage.getItem("theme") || 'DARK';
document.body.className = "theme-" + THEME.toLowerCase();
</script>,
<script>
var datesData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\x3A\x7B\x22id\x22\x3A\x2287\x22,\x22title\x22\x3A\x22Liverpool\x22,\x22short_title\x22\x3A\x22LIV\x22\x7D,\x22a\x22\x3A\x7B\x22id\x22\x3A\x2279\x22,\x22title\x22\x3A\x22Norwich\x22,\x22short_title\x22\x3A\x22NOR...
and so on
]
As you can see, the second list contains 'var datesData' but my code won't print it.
What I want is to get that second script from the list and get the data within the JSON.parse() so I can create a dataframe eventually. One option I can do is copy that entire line from the url's source code and pass it on to json.loads() to use it like:
js = json.loads('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\...')
which gives me an output of:
[{'id': '11643',
'isResult': True,
'h': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'},
'a': {'id': '79', 'title': 'Norwich', 'short_title': 'NOR'},
'goals': {'h': '4', 'a': '1'},
'xG': {'h': '2.23456', 'a': '0.842407'},
'datetime': '2019-08-09 20:00:00',
'forecast': {'w': '0.7377', 'd': '0.1732', 'l': '0.0891'}},
{'id': '11644',
'isResult': True,
'h': {'id': '81', 'title': 'West Ham', 'short_title': 'WHU'},
'a': {'id': '88', 'title': 'Manchester City', 'short_title': 'MCI'},
'goals': {'h': '0', 'a': '5'},
'xG': {'h': '1.2003', 'a': '3.18377'},
'datetime': '2019-08-10 12:30:00',
'forecast': {'w': '0.0452', 'd': '0.1166', 'l': '0.8382'}},
{'id': '11645',
'isResult': True,
...
However, the better way is to call the data from the website so I can account for changes that WILL happen later to the data.
TLDR: I want to read the data stored in a script tag in a readable format using Python
s.text.