I'm sending the following get request
<a href="#" onclick="new Ajax.Request('/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;hide_last_page=true&amp;language_code=en&amp;page=4', {asynchronous:true, evalScripts:true, method:'get', parameters:'authenticity_token=' + encodeURIComponent('FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q==')}); return false;">4</a>
in Python, which I have written as
import urllib
URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;hide_last_page=true&amp;language_code=en&amp;page=4'
s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='
PARAMS = {'asynchronous':True,
'evalScripts':True,
'method':'get',
'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}
r = requests.get(url = URL, params = PARAMS)
I'm new to this, but it seems to be encoded in something that's not ASCII looking text. The returned code also contains HTML code, which is really what I want. Here's a piece of what is returned:
b'Element.update("reviews", "\\n\\u003cdiv class=\\"bookReviewsPaginationCount\\"\\u003e\\n
\\u003cspan class=\\"smallText\\"\\u003e\\nShowing 91-120\\n\\u003c/span\\u003e\\n\\n\\u003c/div\\u003e\\n\\n\\n\\u003cdiv id=\\"reviewControls\\"\\n class=\\"reviewControls u-defaultType clearFix\\"\\u003e\\n \\u003cdiv class=\\"reviewControls--left\\"\\u003e\\n
\\u003cspan class=\\"stars staticStars notranslate\\"\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p3\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\n
\\u003cspan class=\\"u-visuallyHidden\\"\\u003eAverage rating\\u003c/span\\u003e\\n 4.07\\n \\u003cspan class=\\"greyText\\"\\u003e\\u0026nbsp;\\u0026middot;\\u0026nbsp;\\u003c/span\\u003e\\n \\u003c/div\\u003e\\n \\u003cdiv class=\\"reviewControls__ratingDetails reviewControls--left rating_graph\\"\\u003e\\n \\u003cspan id=\\"reviewControls__ratingDetailsMiniGraph\\"\\u003e\\n \\u003cscript type=\\"text/javascript\\"\\u003e\\n
//\\u003c![CDATA[\\n $j(document).ready(function() {\\n var vis = renderRatingGraph(\\n [436969, 351497, 175037, 52003, 27985],\\n \\"reviewControls__ratingDetailsMiniGraph\\");\\n $j(\\"#reviewControls__ratingDetailsMiniGraph\\").prependTo(\\"#rating_details_tip\\");\\n });\\n
Is there a way to parse the code? I've tried:
BeautifulSoup scrape from javascript (encoded) variable
but it does not work with the code that I have returned.
Thanks
bytesobject, so try calling.decode()on it. It will also look more normal if youprintit rather than view the escaped value in the REPL..decode()method only removes the 'b' in front of the text. I want to also remove the escape characters such as \u003, \u0026, etcreprof it - in the string they are the actual characters being escaped. Try callingprinton the result.print(r.content.decode())the string looks like:Element.update("reviews", "\n\u003cdiv class=\"bookReviewsPaginationCount\"\u003e\n \u003cspan class=\"smallText\"\u003e\nShowing 91-120\n\u003c/span\u003e\n\n\u003c/div\u003e\n\n\n\u003cdiv id=\"reviewControls\"\n class=\"reviewControls u-defaultType clearFix\"\u003e\n \u003cdiv class=\"reviewControls--left\"\u003e\n \u003cspan class=\"stars staticStars notranslate\"\u003e\u003cspan size=\"12x12\"Is there a way to turn it into more readable code?