0

I'm sending the following get request

<a href="#" onclick="new Ajax.Request('/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page=4', {asynchronous:true, evalScripts:true, method:'get', parameters:'authenticity_token=' + encodeURIComponent('FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q==')}); return false;">4</a>

in Python, which I have written as

import urllib

URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page=4'

s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='

PARAMS = {'asynchronous':True, 
 'evalScripts':True, 
 'method':'get', 
 'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}

r = requests.get(url = URL, params = PARAMS) 

I'm new to this, but it seems to be encoded in something that's not ASCII looking text. The returned code also contains HTML code, which is really what I want. Here's a piece of what is returned:

b'Element.update("reviews", "\\n\\u003cdiv class=\\"bookReviewsPaginationCount\\"\\u003e\\n    
\\u003cspan class=\\"smallText\\"\\u003e\\nShowing 91-120\\n\\u003c/span\\u003e\\n\\n\\u003c/div\\u003e\\n\\n\\n\\u003cdiv id=\\"reviewControls\\"\\n     class=\\"reviewControls u-defaultType clearFix\\"\\u003e\\n   \\u003cdiv class=\\"reviewControls--left\\"\\u003e\\n    
\\u003cspan class=\\"stars staticStars notranslate\\"\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p3\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\n    
\\u003cspan class=\\"u-visuallyHidden\\"\\u003eAverage rating\\u003c/span\\u003e\\n    4.07\\n    \\u003cspan class=\\"greyText\\"\\u003e\\u0026nbsp;\\u0026middot;\\u0026nbsp;\\u003c/span\\u003e\\n  \\u003c/div\\u003e\\n  \\u003cdiv class=\\"reviewControls__ratingDetails reviewControls--left rating_graph\\"\\u003e\\n    \\u003cspan id=\\"reviewControls__ratingDetailsMiniGraph\\"\\u003e\\n    \\u003cscript type=\\"text/javascript\\"\\u003e\\n    
//\\u003c![CDATA[\\n      $j(document).ready(function() {\\n        var vis = renderRatingGraph(\\n            [436969, 351497, 175037, 52003, 27985],\\n            \\"reviewControls__ratingDetailsMiniGraph\\");\\n        $j(\\"#reviewControls__ratingDetailsMiniGraph\\").prependTo(\\"#rating_details_tip\\");\\n      });\\n  

Is there a way to parse the code? I've tried:

BeautifulSoup scrape from javascript (encoded) variable

but it does not work with the code that I have returned.

Thanks

5
  • post your code ! in order for us to detect where the issue is ! Commented Nov 17, 2019 at 0:30
  • This is a bytes object, so try calling .decode() on it. It will also look more normal if you print it rather than view the escaped value in the REPL. Commented Nov 17, 2019 at 0:33
  • Ok, I've added more code. And @kaya3, the .decode() method only removes the 'b' in front of the text. I want to also remove the escape characters such as \u003, \u0026, etc Commented Nov 17, 2019 at 0:44
  • This escape characters are only printed as escape characters because you're looking at the repr of it - in the string they are the actual characters being escaped. Try calling print on the result. Commented Nov 17, 2019 at 1:05
  • When I run print(r.content.decode()) the string looks like: Element.update("reviews", "\n\u003cdiv class=\"bookReviewsPaginationCount\"\u003e\n \u003cspan class=\"smallText\"\u003e\nShowing 91-120\n\u003c/span\u003e\n\n\u003c/div\u003e\n\n\n\u003cdiv id=\"reviewControls\"\n class=\"reviewControls u-defaultType clearFix\"\u003e\n \u003cdiv class=\"reviewControls--left\"\u003e\n \u003cspan class=\"stars staticStars notranslate\"\u003e\u003cspan size=\"12x12\" Is there a way to turn it into more readable code? Commented Nov 17, 2019 at 1:15

1 Answer 1

1

The returned string looks like jQuery code that is used to generated HTML element using string literal. You probably need to grab that string literal using slice r.text[27:-2] and then use encode().decode('unicode_escape') to get the string that can be parsed by BeatifulSoup.

import urllib
import urllib.parse
import requests
from bs4 import BeautifulSoup as Soup

URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page=4'

s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='

PARAMS = {'asynchronous':True, 
 'evalScripts':True, 
 'method':'get', 
 'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}

r = requests.get(url = URL, params = PARAMS) 
soup = Soup(r.text.encode('utf-8'), 'html.parser')

html_str = r.text[27:-2].encode().decode('unicode_escape')
soup = Soup(html_str, "html.parser")
print(soup)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, this is exactly what I needed

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.