17

My code makes gets some content from an UserVoice site. As you might know, UserVoice is a shitty piece of software that can't handle data correctly; indeed, to reduce the amount of text on the search page, they cut the text at, let's say, 300 characters and then add a "..." to the end. Thing is, they don't care cutting in the middle of a multi-bytes character, resulting in a partial utf-8 "byte": eg. for the è char, I got \xc3 instead of \xc3\xa8s.

Of course, when I give this horrible soup to json.loads, it fails with UnicodeDecodeError. So my question is simple: how can I ask json.loads to ignore these bad bytes, as I would do using .decode('utf-8', 'ignore') if I had access to the internals of the function?

Thanks.

2 Answers 2

13

You don't ask simplejson to ignore them. When I got similar problem like yours I just ran .decode('utf-8', 'ignore').encode('utf-8') and proceed.

Sign up to request clarification or add additional context in comments.

1 Comment

Ok, I was currently writing an answer saying I may simply decode the string before passing it to json.loads. Thanks, it obviously works!
11

Just pass Unicode string to json.loads():

>>> badstr = "qualité"[:-1]+".."
>>> badstr
'qualit\xc3..'
>>> json_str = '["%s"]' % badstr
>>> import json
>>> json.loads(json_str)
Traceback (most recent call last):
 ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 6: invalid \
continuation byte
>>> json.loads(json_str.decode('utf-8','ignore'))
[u'qualit..']

2 Comments

The answer from @Lucho includes an additional .encode, is it needed?
@Matteo: no. json format is defined for Unicode text and therefore .encode() after the .decode() is not necessary.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.