-1

I am trying to scrape the news headlines from this page. It appears that the headlines are contained in a json-object named App inside a pair of script tags. If you're reading this in the future, you can assume it looked something like this

    string = '''{"page":{"lang":"en","error":{"state":false,"type":null}},"system":{"referrer":null,"cookie":[],"params":{"get":[],"post":[]}},"components":{"search-fast-links":[{"name":"FY 2022 preliminary financial results","link":"\/en\/investors-and-media\/news\/press-releases\/08-02-2023\/","detail":""},{"name":"Re-domiciliation Q&A","link":"\/en\/investors-and-media\/shareholder-centre\/current-qa\/","detail":""}],"press-release":{"items":[{"name":"Q4 and FY 2023 production results","date":1706648400,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/31-01-2024\/?","theme":["Production results"],"files":[[{"name":"2024_01_31_Q4_Production_results_eng","type":"pdf","size":"402.15 Kb","link":"\/upload\/ib\/1\/24-01-31\/2024_01_31_Q4_Production_results_eng.pdf"}]]},{"name":"Notice regarding a change of a major shareholder","date":1706475600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-01-2024\/?","theme":["Regulatory disclosures","Shareholder information"],"files":[[{"name":"2024_01_29_Notice regarding_a_change_of_a_major_shareholder_eng","type":"pdf","size":"279.34 Kb","link":"\/upload\/ib\/1\/24-01-29\/2024_01_29_Notice regarding_a_change_of_a_major_shareholder_eng.pdf"}]]},{"name":"Nominated brokers for the purpose of the Exchange Offer","date":1705525200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/18-01-2024\/?","theme":["Shareholder information"],"files":[[{"name":"2024_01_18_Nominated_brokers_eng","type":"pdf","size":"202.73 Kb","link":"\/upload\/ib\/1\/24-01-18\/2024_01_18_Nominated_brokers_eng.pdf"}]]},{"name":"Total Voting Rights as at 29 December 2023 ","date":1703797200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-12-2023\/?","theme":["Regulatory disclosures"],"files":[[{"name":"2023_12_29_TVR_eng","type":"pdf","size":"114.49 Kb","link":"\/upload\/ib\/1\/23-12-29\/2023_12_29_TVR_eng.pdf"}]]},{"name":"Receives the most prestigious corporate social responsibility award in the Republic of Kazakhstan","date":1702414800,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/13-12-2023\/?","theme":["ESG","Other"],"files":[[{"name":"2023_12_13_Paryz_award_eng","type":"pdf","size":"200.72 Kb","link":"\/upload\/ib\/1\/23-12-12\/2023_12_13_Paryz_award_eng.pdf"}]]},{"name":"Results of GM","date":1702242000,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/11-12-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_12_11_GM_results_eng","type":"pdf","size":"215.86 Kb","link":"\/upload\/ib\/1\/23-12-10\/2023_12_11_GM_results_eng.pdf"}]]},{"name":"Offer to exchange certain shares currently affected by the EU asset freeze on NSD and Notice of General Meeting","date":1700686800,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/23-11-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_11_23_Exchange_offer_GM_eng","type":"pdf","size":"241.23 Kb","link":"\/upload\/ib\/1\/23-11-23\/2023_11_23_Exchange_offer_GM_eng.pdf"}]]},{"name":"Q3 2023 production results","date":1698699600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/31-10-2023\/?","theme":["Production results"],"files":[[{"name":"2023_10_31_Q3_Production_results_eng","type":"pdf","size":"404.46 Kb","link":"\/upload\/ib\/1\/23-10-31\/2023_10_31_Q3_Production_results_eng.pdf"}]]},{"name":"Results of new share issues ","date":1696971600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/11-10-2023-c\/?","theme":["Other","Regulatory disclosures","Shareholder information"],"files":[[{"name":"2023_10_11_Results_of_new_share_issues_eng","type":"pdf","size":"124.73 Kb","link":"\/upload\/ib\/1\/23-10-11\/2023_10_11_Results_of_new_share_issues_eng.pdf"}]]},{"name":"Completion of Exchange Offer","date":1696971600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/11-10-2023-b\/?","theme":["Other","Shareholder information"],"files":[[{"name":"2023_10_11_Results_of_Exchange_Offer_eng","type":"pdf","size":"215.34 Kb","link":"\/upload\/ib\/1\/23-10-11\/2023_10_11_Results_of_Exchange_Offer_eng.pdf"}]]},{"name":"Director\/PDMR Shareholding","date":1696971600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/11-10-2023-a\/?","theme":["Regulatory disclosures"],"files":[[{"name":"2023_10_11_PDMR_Notification_eng","type":"pdf","size":"131.9 Kb","link":"\/upload\/ib\/1\/23-10-11\/2023_10_11_PDMR_Notification_eng.pdf"}]]},{"name":"Half-year report for the six month ended 30 June 2023","date":1695589200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/25-09-2023\/?","theme":["Financial results"],"files":[[{"name":"2023_09_25_POLY_1H_2023_Half_yearly_report_eng","type":"pdf","size":"2.07 Mb","link":"\/upload\/ib\/1\/23-09-25\/2023_09_25_POLY_1H_2023_Half_yearly_report_eng.pdf"}]]},{"name":"London De-listing ","date":1693281960,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/29-08-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_08_29_London_De-listing_eng","type":"pdf","size":"121.65 Kb","link":"\/upload\/ib\/1\/23-08-29\/2023_08_29_London_De-listing_eng.pdf"}]]},{"name":"Resumption of trading on AIX","date":1691647200,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/10-08-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_08_10_AIX_trading_resumption_eng","type":"pdf","size":"219.65 Kb","link":"\/upload\/ib\/1\/23-08-10\/2023_08_10_AIX_trading_resumption_eng.pdf"}]]},{"name":"Q2 2023 production results","date":1691528400,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/09-08-2023\/?","theme":["Production results"],"files":[[{"name":"2023_08_09_Q2_Production_eng","type":"pdf","size":"405.7 Kb","link":"\/upload\/ib\/1\/23-08-09\/2023_08_09_Q2_Production_eng.pdf"}]]},{"name":"Re-Domiciliation to AIFC Completed","date":1691442000,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/08-08-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_08_08_Re-domiciliation_eng","type":"pdf","size":"199.45 Kb","link":"\/upload\/ib\/1\/23-08-08\/2023_08_08_Re-domiciliation_eng.pdf"}]]},{"name":"Suspension of Trading on the London Stock Exchange","date":1690862400,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/01-08-2023\/?","theme":["Other","Shareholder information"],"files":[[{"name":"2023_08_01_London_Suspension_eng","type":"pdf","size":"221.29 Kb","link":"\/upload\/ib\/1\/23-08-02\/2023_08_01_London_Suspension_eng.pdf"}]]},{"name":"Results of GM","date":1690516800,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/28-07-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_07_28_GM_results_eng","type":"pdf","size":"214.79 Kb","link":"\/upload\/ib\/1\/23-07-28\/2023_07_28_GM_results_eng.pdf"}]]},{"name":"Results of AGM","date":1690257600,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/25-07-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_07_25_AGM_results_eng","type":"pdf","size":"244.42 Kb","link":"\/upload\/ib\/1\/23-07-25\/2023_07_25_AGM_results_eng.pdf"}]]},{"name":"Update to the timetable of the Re-domiciliation ","date":1689912000,"type":"\u041f\u0440\u0435\u0441\u0441-\u0440\u0435\u043b\u0438\u0437\u044b","link":"\/en\/investors-and-media\/news\/press-releases\/21-07-2023\/?","theme":["Shareholder information"],"files":[[{"name":"2023_07_21_Update_to_Re-domiciliaton_Timetable_eng","type":"pdf","size":"234.32 Kb","link":"\/upload\/ib\/1\/23-07-21\/2023_07_21_Update_to_Re-domiciliaton_Timetable_eng.pdf"}]]}],"nav":{"count":883,"total":45,"current":1},"filters":{"theme":[{"text":"Assets","id":540,"disabled":false,"selected":false},{"text":"Corporate governance","id":532,"disabled":false,"selected":false},{"text":"Dividends","id":531,"disabled":false,"selected":false},{"text":"ESG","id":539,"disabled":false,"selected":false},{"text":"Exploration","id":533,"disabled":false,"selected":false},{"text":"Financial results","id":530,"disabled":false,"selected":false},{"text":"Indexes and ratings","id":537,"disabled":false,"selected":false},{"text":"JV","id":534,"disabled":false,"selected":false},{"text":"Other","id":541,"disabled":false,"selected":false},{"text":"Production results","id":529,"disabled":false,"selected":false},{"text":"Regulatory disclosures","id":536,"disabled":false,"selected":false},{"text":"Reports","id":535,"disabled":false,"selected":false},{"text":"Shareholder information","id":538,"disabled":false,"selected":false}],"years":[{"text":"2024","id":2024,"disabled":false,"selected":false},{"text":"2023","id":2023,"disabled":false,"selected":false},{"text":"2022","id":2022,"disabled":false,"selected":false},{"text":"2021","id":2021,"disabled":false,"selected":false},{"text":"2020","id":2020,"disabled":false,"selected":false},{"text":"2019","id":2019,"disabled":false,"selected":false},{"text":"2018","id":2018,"disabled":false,"selected":false},{"text":"2017","id":2017,"disabled":false,"selected":false},{"text":"2016","id":2016,"disabled":false,"selected":false},{"text":"2015","id":2015,"disabled":false,"selected":false},{"text":"2014","id":2014,"disabled":false,"selected":false},{"text":"2013","id":2013,"disabled":false,"selected":false},{"text":"2012","id":2012,"disabled":false,"selected":false},{"text":"2011","id":2011,"disabled":false,"selected":false},{"text":"2010","id":2010,"disabled":false,"selected":false},{"text":"2009","id":2009,"disabled":false,"selected":false},{"text":"2008","id":2008,"disabled":false,"selected":false},{"text":"2007","id":2007,"disabled":false,"selected":false}]}},"footer":{"documents":[{"link":"\/upload\/ib\/88\/23-06-07\/Polymetal_General_Privacy_Notice_eng.pdf","name":"Privacy notice","fileInfo":"PDF (156.13 Kb)"},{"link":"\/upload\/ib\/62\/23-06-29\/2022_Polymetal_Modern_Slavery_Statement.pdf","name":"Modern Slavery Act Transparency Statement 2022","fileInfo":"PDF (435.06 Kb)"}],"links":[{"name":"Glossary ","link":"\/en\/glossary\/"},{"name":"Sitemap","link":"\/en\/sitemap\/"}],"tune":[{"name":"Contacts","link":"\/en\/contacts\/"},{"name":"Hotline","link":"\/en\/contacts\/hotline\/"}],"danger":"<div class=\"footer__info--text\">\r\n    <span>Please note that <a href=\"https:\/\/www.polymetalinternational.com\/\" class=\"link link--inline\">https:\/\/www.polymetalinternational.com\/<\/a> is the only official URL of&nbsp;Polymetal International plc.  <a href=\"https:\/\/www.polymetal.ru\/\" class=\"link link--inline\">https:\/\/www.polymetal.ru\/<\/a> is related to JSC Polymetal.<\/span>\r\n<\/div>\r\n<div class=\"footer__info--text\">\r\n    <span>Other websites even if&nbsp;they resemble the official ones and\/or contain full or&nbsp;a&nbsp;part of&nbsp;the Company&rsquo;s name in&nbsp;their URL do&nbsp;not relate to&nbsp;Polymetal International plc or&nbsp;its subsidiaries.<\/span>\r\n<\/div>\r\n<div class=\"footer__info--text\">\r\n    <span>Polymetal International plc does not have any official accounts in social media except of <a href=\"https:\/\/www.youtube.com\/channel\/UCddB8YqIjZnak6mlmTcpr3w\" class=\"link link--inline\">Youtube<\/a> and <a href=\"https:\/\/www.linkedin.com\/company\/polymetal\" class=\"link link--inline\">LinkedIn<\/a>. Any statements purportedly provided on behalf of a company is deliberate misrepresentation.<\/span>\r\n<\/div>"}}}
    '''
    import json
    json.loads(string)

My question is the following: What's the best way to parse this to python something that python will recognize as json?

  1. I had a look at: "js2py", but I couldn't find anything that did what I want.
  2. I also tried to use string.replace. After replacing all booleans and nonetype with the python equivalents to javascript, I was able to put it through json.load, but I'm concerned with simply replacing every substring of 'false' with 'False', and 'null' with 'None' because the data might change in the future such that either 'false' or 'null' appears in the middle of some other substring that is not a bool, and by replacing it, the contents can get changed in unpredictable ways.
  3. I also had a look at this question, which at first glance looks like the same question, but the answer that was provided was specific to the json data that the OP provided. It would be positive to have an answer independent of the actual content which would work for all json.
  4. I tried to remove App = and ;, and put the json into the variable called string and put it through json.loads. But I'm getting lots of errors:
>>> json.loads(string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 2 column 10378 (char 10378)
12
  • drop the App = at start and ; part at the end, and just parse it in python as json string with what ever package/lib/method you want. Commented Feb 14, 2024 at 20:34
  • 3
    There is no such thing as Python JSON and Javascript JSON. There is only one JSON. That's the whole point of JSON, so it can be used to interchange information between different systems which might use different programming languages. What you need to do is serialize Json in one system and deserialize it (~ transform it into an programming language specific representation of the JSON data) in the other. All modern languages provide bulit-in support for (de-)serialization of JSON or if they don't there is a library. Commented Feb 14, 2024 at 20:34
  • @Marc I tried that. I will update my question Commented Feb 14, 2024 at 20:38
  • Note that JSON is a subset of JavaScript syntax. For example, JavaScript allows you to use either single or double quotes, JSON only allows double quotes. The quotes are optional in JS for object keys if they're valid identifiers, not optional in JSON. So you can't generally count on being able to parse a JS literal as JSON. Commented Feb 14, 2024 at 20:41
  • 1
    Your error message mentions column 10378 of line 2. None of the lines in your example are that long. So we can't tell what's wrong. Commented Feb 14, 2024 at 20:43

1 Answer 1

1

I dont know if you are using selenium or BeautifulSoup but Try:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://polymetalinternational.com/en/investors-and-media/news/press-releases/'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the script tag containing the 'App' object
    def filter_scripts(text):
     
        if text:
            return "App = " in text
        return False
    script_tag = soup.find('script', text=filter_scripts)

    # Extract the content of the 'App' object
    app_content = script_tag.text.split('App = ')[1].strip()[:-1]

    # Load the content as JSON
    app_data = json.loads(app_content.split("};")[0] + "}" )

    headlines = app_data['components']['press-release']['items']

    for headline in headlines:
        print(f"Name: {headline['name']}")
        print(f"Date: {headline['date']}")
        print(f"Link: {headline['link']}")
        print()


Sign up to request clarification or add additional context in comments.

2 Comments

FYI, text= is deprecated, it should be string=. And you could just use string=re.compile('App =')
It worked after I changed the line to app_content = script_tag.renderContents().decode("utf-8").split('App = ')[1].strip()[:-1]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.