4

I have the following HTML, and what should I do to extract the JSON from the variable: window.__INITIAL_STATE__

<!DOCTYPE doctype html>

<html lang="en">
<script>
                  window.sessConf = "-2912474957111138742";
                  /* <sl:translate_json> */
                  window.__INITIAL_STATE__ = { /* Target JSON here with 12 million characters */};
                  /* </sl:translate_json> */
                </script>
</html>
7
  • What have you tried so far? Commented Mar 4, 2019 at 21:21
  • Is the 12 million character json all in a single line? That would simplify the answer a lot. Commented Mar 4, 2019 at 21:25
  • @JeffUK I've tried get all text from the script tag, then split('\n'), but it somehow break the JSON into couple substrings. Commented Mar 4, 2019 at 21:29
  • @solarc yes, it is a single line JSON. Commented Mar 4, 2019 at 21:30
  • Do you have nodejs in your system? Commented Mar 4, 2019 at 21:30

2 Answers 2

5

You can use the following Python code to extract the JavaScript code.

soup = BeautifulSoup(html)
s=soup.find('script')
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));'
with open('temp.js','w') as f:
    f.write(js)

The JS code will be written to a file "temp.js". Then you can call node to execute the JS file.

from subprocess import check_output
window_init_state = check_output(['node','temp.js'])

The python variable window_init_state contains the JSON string of the JS object window.__INITIAL_STATE__, which you can parse in python with JSONDecoder.

Example

from subprocess import check_output
import json, bs4
html='''<!DOCTYPE doctype html>

<html lang="en">
<script> window.sessConf = "-2912474957111138742";
                  /* <sl:translate_json> */
                  window.__INITIAL_STATE__ = { 'Hello':'World'};
                  /* </sl:translate_json> */
                </script>
</html>'''
soup = bs4.BeautifulSoup(html)
with open('temp.js','w') as f:
    f.write('window = {};\n'+
            soup.find('script').text.strip()+
            ';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));')
window_init_state = check_output(['node','temp.js'])
print(json.loads(window_init_state))

Output:

{'Hello': 'World'}
Sign up to request clarification or add additional context in comments.

7 Comments

I got: FileNotFoundError: [WinError 2] The system cannot find the file specified from the check_output. Would it potentially caused by with open('temp.js','w') as f ?
That's probably because the program node is not included in your PATH environmental variable. Have you successfully installed it?
If you are not sure where the nodejs has been installed. You can it lookup at the locations listed here.
I have a further question, is there a way to avoid writing the js file into hard drive, but achieve the same goal?
Yes. Try using check_output(['node','-e', your_js_script]) where your_js_script is a python string variable that contains the JS script.
|
0

gdlmx's code is correct and very helpfull.

from subprocess import check_output
soup = BeautifulSoup(html)
s=soup.find('script')
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));'
window_init_state = check_output(['node','temp.js'])

type(window_init_state) will be . So then you shuld use following code.

jsonData= window_init_state.decode("utf-8")

1 Comment

What are you trying to tell us? :) In your check_output line a "temp.js" is mentioned which isn't part of your answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.