0

i have a problematic json string contains some funky unicode characters

"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}

and if I convert using python

import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
json.loads(s) 
# Error..

If I can accept to skip/lose the value of these unicode characters, what is the best way to make my json.loads(s) works?

2
  • It seems you are missing some slashes. When I add the slashes, it works just fine. In [11]: s = r'{"test":{"foo":"Ig0s\\x5C/k\\x5C/4jRk"}}' In [12]: json.loads(s) Out[12]: {u'test': {u'foo': u'Ig0s\\x5C/k\\x5C/4jRk'}} Commented Apr 26, 2014 at 8:21
  • @shiplu.mokadd.im: Yeah, sure, then you just escaped the escapes. That's not the point though, is it. Commented Apr 26, 2014 at 9:02

3 Answers 3

1

If the rest of the string apart from invalid \x5c is a JSON then you could use string-escape encoding to decode `'\x5c into backslashes:

>>> import json
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> json.loads(s.decode('string-escape')) 
{u'test': {u'foo': u'Ig0s/k/4jRk'}}
Sign up to request clarification or add additional context in comments.

Comments

1

You don't have JSON; that can be interpreted directly as Python instead. Use ast.literal_eval():

>>> import ast
>>> s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
>>> ast.literal_eval(s)
{'test': {'foo': 'Ig0s\\/k\\/4jRk'}}

The \x5C is a single backslash, doubled in the Python literal string representation here. The actual string value is:

>>> print _['test']['foo']
Ig0s\/k\/4jRk

This parses the input as Python source, but only allows for literal values; strings, None, True, False, numbers and containers (lists, tuples, dictionaries).

This method is slower than json.loads() because it does part of the parse-tree processing in pure Python code.

Another approach would be to use a regular expression to replace the \xhh escape codes with JSON \uhhhh codes:

import re

escape_sequence = re.compile(r'\\x([a-fA-F0-9]{2})')

def repair(string):
    return escape_sequence.sub(r'\\u00\1', string)

Demo:

>>> import json
>>> json.loads(repair(s))
{u'test': {u'foo': u'Ig0s\\/k\\/4jRk'}}

If you can repair the source producing this value to output actual JSON instead that'd be a much better solution.

1 Comment

ast.literal_eval() might fail on JSON literals such as true, false, null. The regex may fail if x.. is escaped by even number of backslashes. You could use 'string-escape' encoding instead
0

I'm a bit late for the party, but we were seeing a similar issue, to be precise this one Logstash JSON input with escaped double quote, just for \xXX.

There JS.stringify created such (per specification) invalid json texts.

The solution is to simply replace the \x by \u00, as unicode escaped characters are allowed, while ASCII escaped characters are not.

import json
s = r'{"test":{"foo":"Ig0s\x5C/k\x5C/4jRk"}}'
s = s.replace("\\x", "\\u00")
json.loads(s)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.