How to parse returned Javascript code from get request in Python

Question

I'm sending the following get request

<a href="#" onclick="new Ajax.Request('/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page=4', {asynchronous:true, evalScripts:true, method:'get', parameters:'authenticity_token=' + encodeURIComponent('FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q==')}); return false;">4</a>

in Python, which I have written as

import urllib

URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page=4'

s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='

PARAMS = {'asynchronous':True, 
 'evalScripts':True, 
 'method':'get', 
 'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}

r = requests.get(url = URL, params = PARAMS)

I'm new to this, but it seems to be encoded in something that's not ASCII looking text. The returned code also contains HTML code, which is really what I want. Here's a piece of what is returned:

b'Element.update("reviews", "\\n\\u003cdiv class=\\"bookReviewsPaginationCount\\"\\u003e\\n    
\\u003cspan class=\\"smallText\\"\\u003e\\nShowing 91-120\\n\\u003c/span\\u003e\\n\\n\\u003c/div\\u003e\\n\\n\\n\\u003cdiv id=\\"reviewControls\\"\\n     class=\\"reviewControls u-defaultType clearFix\\"\\u003e\\n   \\u003cdiv class=\\"reviewControls--left\\"\\u003e\\n    
\\u003cspan class=\\"stars staticStars notranslate\\"\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p3\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\n    
\\u003cspan class=\\"u-visuallyHidden\\"\\u003eAverage rating\\u003c/span\\u003e\\n    4.07\\n    \\u003cspan class=\\"greyText\\"\\u003e\\u0026nbsp;\\u0026middot;\\u0026nbsp;\\u003c/span\\u003e\\n  \\u003c/div\\u003e\\n  \\u003cdiv class=\\"reviewControls__ratingDetails reviewControls--left rating_graph\\"\\u003e\\n    \\u003cspan id=\\"reviewControls__ratingDetailsMiniGraph\\"\\u003e\\n    \\u003cscript type=\\"text/javascript\\"\\u003e\\n    
//\\u003c![CDATA[\\n      $j(document).ready(function() {\\n        var vis = renderRatingGraph(\\n            [436969, 351497, 175037, 52003, 27985],\\n            \\"reviewControls__ratingDetailsMiniGraph\\");\\n        $j(\\"#reviewControls__ratingDetailsMiniGraph\\").prependTo(\\"#rating_details_tip\\");\\n      });\\n

Is there a way to parse the code? I've tried:

BeautifulSoup scrape from javascript (encoded) variable

but it does not work with the code that I have returned.

Thanks

post your code ! in order for us to detect where the issue is ! — αԋɱҽԃ αмєяιcαη
– αԋɱҽԃ αмєяιcαη, Commented Nov 17, 2019 at 0:30
This is a bytes object, so try calling .decode() on it. It will also look more normal if you print it rather than view the escaped value in the REPL. — kaya3
– kaya3, Commented Nov 17, 2019 at 0:33
Ok, I've added more code. And @kaya3, the .decode() method only removes the 'b' in front of the text. I want to also remove the escape characters such as \u003, \u0026, etc — sadlyfe
– sadlyfe, Commented Nov 17, 2019 at 0:44
This escape characters are only printed as escape characters because you're looking at the repr of it - in the string they are the actual characters being escaped. Try calling print on the result. — kaya3
– kaya3, Commented Nov 17, 2019 at 1:05
When I run print(r.content.decode()) the string looks like: Element.update("reviews", "\n\u003cdiv class=\"bookReviewsPaginationCount\"\u003e\n \u003cspan class=\"smallText\"\u003e\nShowing 91-120\n\u003c/span\u003e\n\n\u003c/div\u003e\n\n\n\u003cdiv id=\"reviewControls\"\n class=\"reviewControls u-defaultType clearFix\"\u003e\n \u003cdiv class=\"reviewControls--left\"\u003e\n \u003cspan class=\"stars staticStars notranslate\"\u003e\u003cspan size=\"12x12\" Is there a way to turn it into more readable code? — sadlyfe
– sadlyfe, Commented Nov 17, 2019 at 1:15

VietHTran · Accepted Answer · 2019-11-17 01:43:16Z

1

The returned string looks like jQuery code that is used to generated HTML element using string literal. You probably need to grab that string literal using slice r.text[27:-2] and then use encode().decode('unicode_escape') to get the string that can be parsed by BeatifulSoup.

import urllib
import urllib.parse
import requests
from bs4 import BeautifulSoup as Soup

URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page=4'

s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='

PARAMS = {'asynchronous':True, 
 'evalScripts':True, 
 'method':'get', 
 'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}

r = requests.get(url = URL, params = PARAMS) 
soup = Soup(r.text.encode('utf-8'), 'html.parser')

html_str = r.text[27:-2].encode().decode('unicode_escape')
soup = Soup(html_str, "html.parser")
print(soup)

answered Nov 17, 2019 at 1:43

VietHTran

2,3282 gold badges11 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sadlyfe Over a year ago

Thanks, this is exactly what I needed

Collectives™ on Stack Overflow

How to parse returned Javascript code from get request in Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related