Parse JSON containing "\u as UTF-8 bytes" in Python

Question

I have a JSON file from the Facebook's "Download your data" feature and instead of escaping Unicode characters as their codepoint number, it's escaped just as a sequence of UTF-8 bytes.

For example, the letter á (U+00E1) is escaped in the JSON file as \u00c3\u00a1 instead of \u00e1. 0xC3 0xA1 is UTF-8 encoding for U+00E1.

The json library in Python 3 decodes it as Ã¡ which corresponds to U+00C3 and U+00A1.

Is there a way to parse such a file correctly (so that I get the letter á) in Python?

λuser · Accepted Answer · 2018-05-13 13:25:27Z

3

It seems they encoded their Unicode string into bytes using utf-8 then transformed the bytes into JSON. This is very bad behaviour from them.

Python 3 example:

>>> '\u00c3\u00a1'.encode('latin1').decode('utf-8')
'á'

You need to parse the JSON and walk the entire data to fix it:

def visit_list(l):
    return [visit(item) for item in l]

def visit_dict(d):
    return {visit(k): visit(v) for k, v in d.items()}

def visit_str(s):
    return s.encode('latin1').decode('utf-8')

def visit(node):
    funcs = {
        list: visit_list,
        dict: visit_dict,
        str: visit_str,
    }
    func = funcs.get(type(node))
    if func:
        return func(node)
    else:
        return node

incorrect = '{"foo": ["\u00c3\u00a1", 123, true]}'
correct_obj = visit(json.loads(incorrect))

edited May 13, 2018 at 13:25

answered May 13, 2018 at 11:58

λuser

1,00110 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user6247 Over a year ago

This doesn't work if the string is loaded from a file, the string in that case is '\\u00c3\\u00a1'.

λuser Over a year ago

what do you mean "it doesn't work"? does the string contain literal backslashes? if it does, you can use json.loads on them to parse this notation, or alternatively ast.literal_eval.

user6247 Over a year ago

The \u00c3\u00a1 I mentioned in the question is how it's saved in the JSON file itself, that means when I look at the file in a text editor, I see exactly that (Python wasn't involved at that point). So yes, both the file and the string loaded from it contain literal backslashes. But when I use json.loads it's parsed incorrectly into Ã¡

λuser Over a year ago

you need to fix the whole data after parsing. i just edited my answer to add code to show how to fix it.

Collectives™ on Stack Overflow

Parse JSON containing "\u as UTF-8 bytes" in Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related