1

I pulling in some JSON data that has something like this:

{
 "string":"• Christmas 2014 •",
 "layer_id":490,
 "other": "attributes",
 "that_dont": "matter"
}

This JSON is being generated elsewhere and I'm pulling it in via an http request (using json.loads(request.text)).

When I print the string in my console, I get:

⢠Christmas 2014

(and an exceptions.UnicodeDecodeError error if I try to str())

I'm printing the string on a PDF and need the string to literally be:

"\u00B7 Christmas 2014 \u00B7"

My instincts are a bit hacky and I just want to replace the series of strange characters with the proper unicode point, but I don't even know what it is that I'm looking to replace.

1
  • Why U+00B7 and not U+2022? That's the original content, in any case; • Christmas 2014 •. Commented Dec 5, 2014 at 15:53

1 Answer 1

1

Don't use response.text; you are causing a Mojibake here. response.text may end up using the wrong codec if no characterset was specified on the response.

Use response.json() instead, and let that handle the correct codec for your JSON.

If you still see the same result, then the source used cp1252 to decode UTF-8 data and you need to revert that process:

corrected = broken.encode('cp1252').decode('utf8')

which fixes your specific issue:

>>> print u"• Christmas 2014 •".encode('cp1252').decode('utf8')
• Christmas 2014 •

Those are U+2022 BULLET characters.

You could also use the ftfy library, which can handle Mojibake untangling automatically for you:

>>> import ftfy
>>> print ftfy.fix_text(u"• Christmas 2014 •")
• Christmas 2014 •
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.