2

I'm trying to get the valid python list from the response of a server like you can see below:

window.__search.list=[{"order":"1","base":"LAW","n":"148904","access":{"css":"avail_yes","title":"\u042 2\u0435\u043a\u0441\u0442\u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0434\u043e\u0441\u0442\u0443\u043f\u0435\u043d"},"title":"\"\u0410\u0440\u0431\u0438\u0442\u0440\u0430\u0436\u043d\u044b\u0439\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0443\u0430\u043b\u044c\u043d\u044b\u0439\u043a\u043e\u0434\u0435\u043a\u0441\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u043e\u0439\u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u0438\" \u043e\u0442 24.07.2002 N 95-\u0424\u0417 (\u0440\u0435\u0434. \u043e\u0442 02.07.2013) (\u0441 \u0438\u0437\u043c. \u0438 \u0434\u043e\u043f.,\u0432\u0441\u0442\u0443\u043f\u0430 \u044e\u0449\u0438\u043c\u0438\u0432 \u0441\u0438\u043b\u0443 \u0441 01.08.2013)"}, ... }];

I did it through cutting off "window.__search.list=" and ";" from the string using data = json.loads(re.search(r"(?=\[)(.*?)\s*(?=\;)", url).group(1)) and then it was looked like standard JSON:

[{u'access': {u'css': u'avail_yes', u'title': u'\u0422\u0435\u043a\u0441\u0442\u0434\u043e\u043a\u04 43\u043c\u0435\u043d\u0442\u0430 \u0434\u043e\u0441\u0442\u0443\u043f\u0435\u043d'},u'title': u'"\u0410\u0440\u0431\u0438\u0442\u0440\u0430\u0436\u043d\u044b\u0439\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0443\u0430\u043b\u044c\u043d\u044b\u0439\u043a\u043e\u0434\u0435\u043a\u0441\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u043e\u0439\u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u0438" \u043e\u0442 24.07.2002 N 95-\u0424\u0417 (\u04 40\u0435\u0434. \u043e\u0442 02.07.2013) (\u0441 \u0438\u0437\u043c. \u0438 \u0434\u043e \u043f.,\u0432\u0441\u0442\u0443\u043f\u0430\u044e\u0449\u0438\u043c\u0438 \u0432 \u0441 \u0438\u043b\u0443 \u0441 01.08.2013)', u'base': u'LAW', u'order': u'1', u'n': u'148904'}, ... }]

But sometimes, during iterating an others urls I get an error like this:

File "/Developer/Python/test.py", line 123, in order_search
    data = json.loads(re.search(r"(?=\[)(.*?)\s*(?=\;)", url).group(1))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \uXXXX escape: line 1 column 20235 (char 20235)

How can I fix it, or maybe there's an another way to get valid JSON (desirable using native libraries)?

2
  • Actually, those XXXX in the error message you get would interest me. Commented Sep 4, 2013 at 13:52
  • 1
    Ah, I see, the XXXX was verbatim (now reproduced that error message). You should have a look at the given point in the input string then: try: json.loads(x) except ValueError as problem: print x[int(problem.args[0].split(' ')[-1][:-1])-5:][:30] Commented Sep 4, 2013 at 14:01

2 Answers 2

3

Probably, your regular expression has found char ';' somewhere in the middle of a response, and because of this you get an error, because, using your regular expression, you might have received an incomplete, cropped response, and that's why you could not convert it into JSON.

Yes, I agree with user RickyA that sometimes using a native tools, a code will easier to read than trying to make up RegEx. But here, I'd rather to use exactly regular expression, something like this:

data = re.search(r'(?=\[)(.*?)[\;]*$', response).group(1)
/(?=\[)(.*?)[\;]*$/
(?=\[) Positive Lookahead
\[ Literal [
1st Capturing group (.*?)
. 0 to infinite times [lazy] Any character (except newline)
Char class [\;] 0 to infinite times [greedy] matches:
\; The character ;
$ End of string

I believe you meant that the variable 'url' means a response from a server, then maybe better to use name of variable 'response' instead of 'url'.

And, if you've some troubles with using RegEx, I advise you to use an editor of regular expressions, like RegEx 101.This is the online regular expression editor, which explains each block of inputted expression.

Sign up to request clarification or add additional context in comments.

Comments

2

What about:

response = response.strip() #get rid of whitespaces
response = response[response.find("["):] #trim everything before the first '['
if response[-1:] == ";": #if last char == ";"
    response = response[:-1] #trim it

Seems like a big overkill to do this with regex.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.