3

I have a JSON file from the Facebook's "Download your data" feature and instead of escaping Unicode characters as their codepoint number, it's escaped just as a sequence of UTF-8 bytes.

For example, the letter á (U+00E1) is escaped in the JSON file as \u00c3\u00a1 instead of \u00e1. 0xC3 0xA1 is UTF-8 encoding for U+00E1.

The json library in Python 3 decodes it as á which corresponds to U+00C3 and U+00A1.

Is there a way to parse such a file correctly (so that I get the letter á) in Python?

1 Answer 1

3

It seems they encoded their Unicode string into bytes using utf-8 then transformed the bytes into JSON. This is very bad behaviour from them.

Python 3 example:

>>> '\u00c3\u00a1'.encode('latin1').decode('utf-8')
'á'

You need to parse the JSON and walk the entire data to fix it:

def visit_list(l):
    return [visit(item) for item in l]

def visit_dict(d):
    return {visit(k): visit(v) for k, v in d.items()}

def visit_str(s):
    return s.encode('latin1').decode('utf-8')

def visit(node):
    funcs = {
        list: visit_list,
        dict: visit_dict,
        str: visit_str,
    }
    func = funcs.get(type(node))
    if func:
        return func(node)
    else:
        return node

incorrect = '{"foo": ["\u00c3\u00a1", 123, true]}'
correct_obj = visit(json.loads(incorrect))
Sign up to request clarification or add additional context in comments.

4 Comments

This doesn't work if the string is loaded from a file, the string in that case is '\\u00c3\\u00a1'.
what do you mean "it doesn't work"? does the string contain literal backslashes? if it does, you can use json.loads on them to parse this notation, or alternatively ast.literal_eval.
The \u00c3\u00a1 I mentioned in the question is how it's saved in the JSON file itself, that means when I look at the file in a text editor, I see exactly that (Python wasn't involved at that point). So yes, both the file and the string loaded from it contain literal backslashes. But when I use json.loads it's parsed incorrectly into á
you need to fix the whole data after parsing. i just edited my answer to add code to show how to fix it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.