2

I have the JSON object:

{
  "review_body": "Beef noodles realism weathered modem tanto hotdog dolphin long-chain hydrocarbons 8-bit euro-pop tank-traps Tokyo narrative.-space j-pop franchise otaku faded RAF girl artisanal hotdog denim ablative systemic smart-Kowloon. Man construct dome smart-computer pen monofilament beef noodles rain garage geodesic bicycle San Francisco wonton soup dissident nodal point tower. Boat uplink film dead man modem warehouse. Nodal point jeans euro-pop render-farm nano-fetishism semiotics hacker gang. Futurity narrative youtube otaku Kowloon free-market drugs. Fluidity assassin Tokyo bicycle media assault concrete industrial grade ablative lights boat BASE jump A.I. post-stimulate carbon. Physical computer narrative city youtube math-neural assassin modem.",
  "link": "http://www.getlost.com/store/acme/review/10607787#comment10607787",
  "seller_id": "104523",
  "survey_id": "9933447",
  "loggedin_user": 0,
  "store_rating": "8.02",
  "store_thumb": "http://www.getlost.com/store/thumbnail/acme.jpg",
  "store_name": "acme",
  "username": "ronin666",
  "rating": "1",
  "ref": "RR,acme,104523"
}

embedded within

<script LANGUAGE="javascript">
window.commentShare = $.extend((window.commentShare || {}), {
    10607787: {
        "review_body": "Beef noodles realism weathered modem tanto hotdog dolphin long-chain hydrocarbons 8-bit euro-pop tank-traps Tokyo narrative.-space j-pop franchise otaku faded RAF girl artisanal hotdog denim ablative systemic smart-Kowloon. Man construct dome smart-computer pen monofilament beef noodles rain garage geodesic bicycle San Francisco wonton soup dissident nodal point tower. Boat uplink film dead man modem warehouse. Nodal point jeans euro-pop render-farm nano-fetishism semiotics hacker gang. Futurity narrative youtube otaku Kowloon free-market drugs. Fluidity assassin Tokyo bicycle media assault concrete industrial grade ablative lights boat BASE jump A.I. post-stimulate carbon. Physical computer narrative city youtube math-neural assassin modem.",
        "link": "http:\/\/www.getlost.com\/store\/acme\/review\/10607787#comment10607787",
        "seller_id": "104523",
        "survey_id": "9933447",
        "loggedin_user": 0,
        "store_rating": "8.02",
        "store_thumb": "http:\/\/www.getlost.com\/store\/thumbnail\/acme.jpg",
        "store_name": "acme",
        "username": "ronin666",
        "rating": "1",
        "ref": "RR,acme,104523"
    }
});
</script>

I'd like to extract the aforementioned JSON object. How can this be achieved? Should I use regular expressions?

How this type of object can be obtained (via Ipython, python 2.7):

I was basically scraping the review site resellerratings.com for an arbitrary store using BeautifulSoup. I obtained the soup object and noticed that there are useful JSON objects containing information for each review at the chosen store. However, upon invoking soup.find("script", language = "javascript"), I'm still left with the JSON object embedded within the script tags.

from mechanize import Browser
import bs4
from bs4 import BeautifulSoup

br = Browser()
br.set_handle_robots(False)
br.set_handle_refresh(False)

example_url = 'http://www.resellerratings.com/store/My_Digital_Palace'

response = br.open(example_url)
soup = BeautifulSoup(response)
soup.find("script", language = "javascript")

This should return:

<script language="javascript">
window.commentShare = $.extend(
    (window.commentShare || {}), {
        375015: {
            "review_body": "I bought a Kodak LS443 form My Digital Palace in 2004.  I also purchased a 5 year warranty.  Now the camera does not work and I am unable to contact them.  What do I do???  Am I just screwed???<br><br>Margaret Fuller<br>[email protected]",
            "link": "http:\/\/www.resellerratings.com\/store\/My_Digital_Palace\/review\/375015#comment375015",
            "seller_id": "6930",
            "survey_id": "385176",
            "loggedin_user": 0,
            "store_rating": "1.00",
            "store_thumb": "http:\/\/www.resellerratings.com\/store\/thumbnail\/My_Digital_Palace.jpg",
            "store_name": "My Digital Palace",
            "username": "maf1059",
            "rating": "1",
            "ref": "RR,My_Digital_Pala,6930"
        }
    }
);
</script>
10
  • are you treating this as just a regular text file that you want to parse or is it included in your web app? Commented Sep 22, 2015 at 20:31
  • @ergonaut This is actually something I scraped using beautiful soup in python. So it's something that I want to parse. Commented Sep 22, 2015 at 20:32
  • 1
    Why are you not simply accessing the object by the variable? Commented Sep 22, 2015 at 20:33
  • a little more info would be helpful...you want to parse in python ? or javascript ? is $.extend a call to jquery or what ? Commented Sep 22, 2015 at 20:35
  • 1
    As it is currently it simply sounds like you're trying to do something terribly wrong. Commented Sep 22, 2015 at 20:41

2 Answers 2

4

Easy enough, just strip out the wrapper and extraneous lines to get at the juicy, juicy JSON itself. The below removes the first four lines and the last three of your javscript snippet (while also putting the initial { back in that got lost):

import json

raw = "{" + "\n".join(str(soup.find("script")).split("\n")[4:-3])

If the <script> objects on the page aren't written in a uniform way (that is, it's not always exactly the first four lines and last two that are the extraneous ones), you may have to resort to regex or other matching. After that, you can go ahead and access the JSON.

json_obj = json.loads(raw)

Your problem was simply a regex/splitting issue. I think folks were a little thrown off by the Javascript. :)

Sign up to request clarification or add additional context in comments.

Comments

2

If you have this JOSN on your page and wish to access it via javascript, you can do so by looping though the object(s) within the window.commentShare object.

Here is a little test function for you to add to your page so you can see how that would work. It will alert one of your JSON values. For completeness, I've added it to the end of your example.

<script  LANGUAGE="javascript">
window.commentShare = $.extend((window.commentShare || {}), {
    10607787: {
        "review_body": "Beef noodles realism weathered modem tanto hotdog dolphin long-chain hydrocarbons 8-bit euro-pop tank-traps Tokyo narrative.-space j-pop franchise otaku faded RAF girl artisanal hotdog denim ablative systemic smart-Kowloon. Man construct dome smart-computer pen monofilament beef noodles rain garage geodesic bicycle San Francisco wonton soup dissident nodal point tower. Boat uplink film dead man modem warehouse. Nodal point jeans euro-pop render-farm nano-fetishism semiotics hacker gang. Futurity narrative youtube otaku Kowloon free-market drugs. Fluidity assassin Tokyo bicycle media assault concrete industrial grade ablative lights boat BASE jump A.I. post-stimulate carbon. Physical computer narrative city youtube math-neural assassin modem.",
        "link": "http:\/\/www.getlost.com\/store\/acme\/review\/10607787#comment10607787",
        "seller_id": "104523",
        "survey_id": "9933447",
        "loggedin_user": 0,
        "store_rating": "8.02",
        "store_thumb": "http:\/\/www.getlost.com\/store\/thumbnail\/acme.jpg",
        "store_name": "acme",
        "username": "ronin666",
        "rating": "1",
        "ref": "RR,acme,104523"
    }
});

function test(){

for (var i in window.commentShare) {
    var myObj = window.commentShare[i];
    alert(myObj.review_body);
 }

}
test();

</script>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.