1

I'm scraping instagram page (https://instagram.com/celmirashop) and get script (HTML and some javascript). the result like this

<script>some script</script>
<script>some script</script>
<script>some script</script>
<script>window._sharedData = {"config":{"csrf_token":"sSqrj6c8tfN1HwOIlwmpqONT2bAPhtNu","viewer":null etc....</script>

I have creating script like this

import urllib.request
import json
import re
from bs4 import BeautifulSoup

web = urllib.request.urlopen("https://instagram.com/celmirashop")
soup = BeautifulSoup(web.read(), 'lxml')
pattern = re.compile(r"window._sharedData = .")
script = soup.find("script",text=pattern)
print(script)

and giving me a result a specific javascript that I want to. like this

<script>window._sharedData = {"config":{"csrf_token":"sSqrj6c8tfN1HwOIlwmpqONT2bAPhtNu","viewer":null etc....</script>

How can I get the value of window._sharedData ? and loop it. because I want save in mysql

2 Answers 2

2

Assuming ends with ; and occurs only once you can use the following regex pattern on the response.text

import re

s = '''<script>window._sharedData = {"config":{"csrf_token":"sSqrj6c8tfN1HwOIlwmpqONT2bAPhtNu","viewer":null"};</script>'''
p = re.compile(r'window\._sharedData = (.*);')
print(p.findall(s)[0])
Sign up to request clarification or add additional context in comments.

Comments

2

Here is a way:

>>> xxx = '''
... <script>window._sharedData = {"config":{"csrf_token":"sSqrj6c8tfN1HwOIlwmpqONT2bAPhtNu","viewer":null etc....</script>
... '''
>>> xxx.split('"csrf_token":"')
['\n<script>window._sharedData = {"config":{', 'sSqrj6c8tfN1HwOIlwmpqONT2bAPhtNu","viewer":null etc....</script>\n']

>>> xxx.split('"csrf_token":"')[1].split('"')[0]
'sSqrj6c8tfN1HwOIlwmpqONT2bAPhtNu'

Just note that BS, doesn't actually run any javascript, so the script tags, or any other javascript isn't actually being run.

You'll have to use something like selenium in order to do something more with it.

If you do go with selenium you can, do something like:

import json
import selenium.webdriver

options = selenium.webdriver.FirefoxOptions()
options.add_argument("--headless")

driver = selenium.webdriver.Firefox(firefox_options=options)

driver.get('https://instagram.com/celmirashop')

# note this assumes there is no circular data, etc in the thing 
# passed to`JSON.stringify`

# run this javascript in the firefox browser
js = "return JSON.stringify(window._sharedData)"

# load the hopefully stringified json to python 
hello = json.loads(driver.execute_script(js))

for k, v in hello.items():
    print(k, v)

1 Comment

return JSON.stringify(window._sharedData) nice +

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.