1

I'm trying to get a set of images array into JSON from javascript using python and beautifulsoup. But I tried many ways but getting errors.

My JS Code on webpage :

<script type="text/javascript">
P.when('A').register("ImageBlockATF", function(A){
var data = {
'colorImages': { 'initial': [{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SL1003_.jpg",
"thumb":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_US40_.jpg",
"large":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_.jpg",
"main":{"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg":[355,355],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY450_.jpg":[450,450],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX425_.jpg":[425,425],
"variant":"MAIN","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX679_.jpg":[679,679]},"variant":"PT01","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX679_.jpg":[679,679]},"variant":"PT02","lowRes":null}]},
'colorToAsin': {'initial': {}},

'airyConfig' :A.$.parseJSON('{"jsUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/js/airy.skin._CB485981857_.js","cssUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/css/beacon._CB485971591_.css","swfUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/flash/AiryBasicRenderer._CB485925577_.swf","foresterMetadataParams":{"marketplaceId":"A2VIGQ35RCS4UG","method":"Kitchen.ImageBlock","requestId":"4MGH16D6R7WCR018779W","session":"259-8488476-1037262","client":"Dpx"}}')

};
A.trigger('P.AboveTheFold'); // trigger ATF event.
return data;
});
</script>

I want to get the data from the key 'colorImages': { into JSON object. My aim is to get all images into JSON object and i can use it on my way.

My code (What i tried) :


url = "https://www.amazon.ae/DubayVintage-Astronaut-Figurine-Spaceman-Sculpture/dp/B08373YYCM/ref=sr_1_1?dchild=1&keywords=B08373YYCM&qid=1619498604&sr=8-1"
soup_main = getResponse(url, UserAgent())

pattern = re.compile(r"var data = { 'colorImages':(\{.*?\})")
script = soup_main.find("script", text=pattern)

data = pattern.search(script.text).group(1)
data = json.loads(data)
print(data)

Error (What I'm getting) :

Traceback (most recent call last):
  File "/home/dobuyme/Desktop/Sharaf DG/scrap.py", line 51, in <module>
    data = pattern.search(script.text).group(1)
AttributeError: 'NoneType' object has no attribute 'text'

I need to get the images to link from the key "colorImages" like this :

[{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SL1003_.jpg",
"thumb":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_US40_.jpg",
"large":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_.jpg",
"main":{"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg":[355,355],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY450_.jpg":[450,450],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX425_.jpg":[425,425],
"variant":"MAIN","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX679_.jpg":[679,679]},"variant":"PT01","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX679_.jpg":[679,679]},"variant":"PT02","lowRes":null}]},
0

1 Answer 1

2

First of all, your regex is not really working. Second of all, you might be getting an empty response, so be sure to add user-agent to request headers.

Finally, the string from the script requires some work before it can be safely dumped to json.loads.

Here's my take on this:

import json
import re

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/90.0.4430.85 Safari/537.36",
}

url = "https://www.amazon.ae/DubayVintage-Astronaut-Figurine-Spaceman-Sculpture/dp/B08373YYCM/ref=sr_1_1?dchild=1&keywords=B08373YYCM&qid=1619498604&sr=8-1"

scripts = BeautifulSoup(requests.get(url, headers=headers).text, "lxml").find_all("script", {"type": "text/javascript"})
filtered_scripts = [s.string for s in scripts if "colorImages" in s.string]
for script in filtered_scripts:
    search = re.search(r"data = (.*),\s'color", script, re.S)
    if search:
        sanitise = (
                search.group(1)
                .replace("'", '"')
                .replace(" ", "")
                .replace("\n", "") + "}"
        )
        data = json.loads(sanitise)
        print(data["colorImages"]["initial"])

Output:

[{'hiRes': 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SL1003_.jpg', 'thumb': 'https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_US40_.jpg', 'large': 'https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_.jpg', 'main': {'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg': [355, 355], 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY450_.jpg': [450, 450], 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX425_.jpg': [425, 425], 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX466_.jpg': [466, 466], 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX522_.jpg': [522, 522], 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX569_.jpg': [569, 569], 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX679_.jpg': [679, 679]}, 'variant': 'MAIN', 'lowRes': None}, {'hiRes': 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SL1005_.jpg', 'thumb': 'https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_US40_.jpg', 'large': 'https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_.jpg', 'main': {'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY355_.jpg': [355, 355], 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY450_.jpg': [450, 450], 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX425_.jpg': [425, 425], 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX466_.jpg': [466, 466], 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX522_.jpg': [522, 522], 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX569_.jpg': [569, 569], 'https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX679_.jpg': [679, 679]}, 'variant': 'PT01', 'lowRes': None}, {'hiRes': 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SL1005_.jpg', 'thumb': 'https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_US40_.jpg', 'large': 'https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_.jpg', 'main': {'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY355_.jpg': [355, 355], 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY450_.jpg': [450, 450], 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX425_.jpg': [425, 425], 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX466_.jpg': [466, 466], 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX522_.jpg': [522, 522], 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX569_.jpg': [569, 569], 'https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX679_.jpg': [679, 679]}, 'variant': 'PT02', 'lowRes': None}]
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks!, its giving result as expected. But why the script gives me blank output sometimes?
You might be getting an empty response, as amazon might recognize you as a bot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.