Get valid python list from string (javascript array)

Question

I'm trying to get the valid python list from the response of a server like you can see below:

window.__search.list=[{"order":"1","base":"LAW","n":"148904","access":{"css":"avail_yes","title":"\u042 2\u0435\u043a\u0441\u0442\u0434\u043e\u043a\u0443\u043c\u0435\u043d\u0442\u0430\u0434\u043e\u0441\u0442\u0443\u043f\u0435\u043d"},"title":"\"\u0410\u0440\u0431\u0438\u0442\u0440\u0430\u0436\u043d\u044b\u0439\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0443\u0430\u043b\u044c\u043d\u044b\u0439\u043a\u043e\u0434\u0435\u043a\u0441\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u043e\u0439\u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u0438\" \u043e\u0442 24.07.2002 N 95-\u0424\u0417 (\u0440\u0435\u0434. \u043e\u0442 02.07.2013) (\u0441 \u0438\u0437\u043c. \u0438 \u0434\u043e\u043f.,\u0432\u0441\u0442\u0443\u043f\u0430 \u044e\u0449\u0438\u043c\u0438\u0432 \u0441\u0438\u043b\u0443 \u0441 01.08.2013)"}, ... }];

I did it through cutting off "window.__search.list=" and ";" from the string using data = json.loads(re.search(r"(?=\[)(.*?)\s*(?=\;)", url).group(1)) and then it was looked like standard JSON:

[{u'access': {u'css': u'avail_yes', u'title': u'\u0422\u0435\u043a\u0441\u0442\u0434\u043e\u043a\u04 43\u043c\u0435\u043d\u0442\u0430 \u0434\u043e\u0441\u0442\u0443\u043f\u0435\u043d'},u'title': u'"\u0410\u0440\u0431\u0438\u0442\u0440\u0430\u0436\u043d\u044b\u0439\u043f\u0440\u043e\u0446\u0435\u0441\u0441\u0443\u0430\u043b\u044c\u043d\u044b\u0439\u043a\u043e\u0434\u0435\u043a\u0441\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u043e\u0439\u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u0438" \u043e\u0442 24.07.2002 N 95-\u0424\u0417 (\u04 40\u0435\u0434. \u043e\u0442 02.07.2013) (\u0441 \u0438\u0437\u043c. \u0438 \u0434\u043e \u043f.,\u0432\u0441\u0442\u0443\u043f\u0430\u044e\u0449\u0438\u043c\u0438 \u0432 \u0441 \u0438\u043b\u0443 \u0441 01.08.2013)', u'base': u'LAW', u'order': u'1', u'n': u'148904'}, ... }]

But sometimes, during iterating an others urls I get an error like this:

File "/Developer/Python/test.py", line 123, in order_search
    data = json.loads(re.search(r"(?=\[)(.*?)\s*(?=\;)", url).group(1))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \uXXXX escape: line 1 column 20235 (char 20235)

How can I fix it, or maybe there's an another way to get valid JSON (desirable using native libraries)?

Actually, those XXXX in the error message you get would interest me. — Alfe
– Alfe, Commented Sep 4, 2013 at 13:52
Ah, I see, the XXXX was verbatim (now reproduced that error message). You should have a look at the given point in the input string then: try: json.loads(x) except ValueError as problem: print x[int(problem.args[0].split(' ')[-1][:-1])-5:][:30] — Alfe
– Alfe, Commented Sep 4, 2013 at 14:01

score 3 · Accepted Answer · 2013-09-08 15:32:05Z

Probably, your regular expression has found char ';' somewhere in the middle of a response, and because of this you get an error, because, using your regular expression, you might have received an incomplete, cropped response, and that's why you could not convert it into JSON.

Yes, I agree with user RickyA that sometimes using a native tools, a code will easier to read than trying to make up RegEx. But here, I'd rather to use exactly regular expression, something like this:

data = re.search(r'(?=\[)(.*?)[\;]*$', response).group(1)

/(?=\[)(.*?)[\;]*$/
(?=\[) Positive Lookahead
\[ Literal [
1st Capturing group (.*?)
. 0 to infinite times [lazy] Any character (except newline)
Char class [\;] 0 to infinite times [greedy] matches:
\; The character ;
$ End of string

I believe you meant that the variable 'url' means a response from a server, then maybe better to use name of variable 'response' instead of 'url'.

And, if you've some troubles with using RegEx, I advise you to use an editor of regular expressions, like RegEx 101.This is the online regular expression editor, which explains each block of inputted expression.

RickyA · Accepted Answer · 2013-09-04 13:55:51Z

2

What about:

response = response.strip() #get rid of whitespaces
response = response[response.find("["):] #trim everything before the first '['
if response[-1:] == ";": #if last char == ";"
    response = response[:-1] #trim it

Seems like a big overkill to do this with regex.

answered Sep 4, 2013 at 13:55

RickyA

16.1k6 gold badges77 silver badges97 bronze badges

Collectives™ on Stack Overflow

Get valid python list from string (javascript array)

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related