0

I am trying to pull the table data from this website - 'https://understat.com/league/EPL' When I viewed the Source code, the table is saved in a . I want to know how to extract the data from the script in a usable format.

I tried using the solution from a similar question (How to Get Script Tag Variables From a Website using Python):

    import requests
    import bs4
    import json

    url = 'https://understat.com/league/EPL'
    r = requests.get(url)

    bs = bs4.BeautifulSoup(r.text, "html.parser")
    scripts = bs.find_all('script')

    for s in scripts:
        if 'var datesData' in s.text:
            script = s.text
            print(script)

However, nothing is getting printed, that is, it can't find 'var datesData' in the script, but when I just print(scripts), I get:

[<script>
            var THEME = localStorage.getItem("theme") || 'DARK';
            document.body.className = "theme-" + THEME.toLowerCase();
        </script>,
 <script>
    var datesData   = JSON.parse('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\x3A\x7B\x22id\x22\x3A\x2287\x22,\x22title\x22\x3A\x22Liverpool\x22,\x22short_title\x22\x3A\x22LIV\x22\x7D,\x22a\x22\x3A\x7B\x22id\x22\x3A\x2279\x22,\x22title\x22\x3A\x22Norwich\x22,\x22short_title\x22\x3A\x22NOR...


and so on
]

As you can see, the second list contains 'var datesData' but my code won't print it.

What I want is to get that second script from the list and get the data within the JSON.parse() so I can create a dataframe eventually. One option I can do is copy that entire line from the url's source code and pass it on to json.loads() to use it like:

js = json.loads('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\...')

which gives me an output of:

[{'id': '11643',
  'isResult': True,
  'h': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'},
  'a': {'id': '79', 'title': 'Norwich', 'short_title': 'NOR'},
  'goals': {'h': '4', 'a': '1'},
  'xG': {'h': '2.23456', 'a': '0.842407'},
  'datetime': '2019-08-09 20:00:00',
  'forecast': {'w': '0.7377', 'd': '0.1732', 'l': '0.0891'}},
 {'id': '11644',
  'isResult': True,
  'h': {'id': '81', 'title': 'West Ham', 'short_title': 'WHU'},
  'a': {'id': '88', 'title': 'Manchester City', 'short_title': 'MCI'},
  'goals': {'h': '0', 'a': '5'},
  'xG': {'h': '1.2003', 'a': '3.18377'},
  'datetime': '2019-08-10 12:30:00',
  'forecast': {'w': '0.0452', 'd': '0.1166', 'l': '0.8382'}},
 {'id': '11645',
  'isResult': True,
...

However, the better way is to call the data from the website so I can account for changes that WILL happen later to the data.

TLDR: I want to read the data stored in a script tag in a readable format using Python

1
  • As a debugging step, print s.text. Commented Jun 17, 2020 at 14:32

1 Answer 1

0

Perhaps something like

import ast
import json
import re
from pprint import pprint

import requests

pattern = re.compile(r'\bvar\s+datesData\s*=\s*JSON\.parse\((.+?)\)')

url = 'https://understat.com/league/EPL'

r = requests.get(url)
s = r.text
m = pattern.search(s)
data = m.group(1)
o = json.loads(ast.literal_eval(data))
pprint(o[:3])

which gives me

[{'a': {'id': '79', 'short_title': 'NOR', 'title': 'Norwich'},
  'datetime': '2019-08-09 20:00:00',
  'forecast': {'d': '0.1732', 'l': '0.0891', 'w': '0.7377'},
  'goals': {'a': '1', 'h': '4'},
  'h': {'id': '87', 'short_title': 'LIV', 'title': 'Liverpool'},
  'id': '11643',
  'isResult': True,
  'xG': {'a': '0.842407', 'h': '2.23456'}},
 {'a': {'id': '88', 'short_title': 'MCI', 'title': 'Manchester City'},
  'datetime': '2019-08-10 12:30:00',
  'forecast': {'d': '0.1166', 'l': '0.8382', 'w': '0.0452'},
  'goals': {'a': '5', 'h': '0'},
  'h': {'id': '81', 'short_title': 'WHU', 'title': 'West Ham'},
  'id': '11644',
  'isResult': True,
  'xG': {'a': '3.18377', 'h': '1.2003'}},
 {'a': {'id': '238', 'short_title': 'SHE', 'title': 'Sheffield United'},
  'datetime': '2019-08-10 15:00:00',
  'forecast': {'d': '0.3923', 'l': '0.3994', 'w': '0.2083'},
  'goals': {'a': '1', 'h': '1'},
  'h': {'id': '73', 'short_title': 'BOU', 'title': 'Bournemouth'},
  'id': '11645',
  'isResult': True,
  'xG': {'a': '1.59864', 'h': '1.34099'}}]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.