Python: handle broken unicode bytes when parsing JSON string

Question

My code makes gets some content from an UserVoice site. As you might know, UserVoice is a shitty piece of software that can't handle data correctly; indeed, to reduce the amount of text on the search page, they cut the text at, let's say, 300 characters and then add a "..." to the end. Thing is, they don't care cutting in the middle of a multi-bytes character, resulting in a partial utf-8 "byte": eg. for the è char, I got \xc3 instead of \xc3\xa8s.

Of course, when I give this horrible soup to json.loads, it fails with UnicodeDecodeError. So my question is simple: how can I ask json.loads to ignore these bad bytes, as I would do using .decode('utf-8', 'ignore') if I had access to the internals of the function?

Thanks.

Lachezar · Accepted Answer · 2011-11-02 17:18:33Z

13

You don't ask simplejson to ignore them. When I got similar problem like yours I just ran .decode('utf-8', 'ignore').encode('utf-8') and proceed.

answered Nov 2, 2011 at 17:18

Lachezar

6,7633 gold badges36 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

zopieux Over a year ago

Ok, I was currently writing an answer saying I may simply decode the string before passing it to json.loads. Thanks, it obviously works!

jfs · Accepted Answer · 2011-11-02 17:53:35Z

11

Just pass Unicode string to json.loads():

>>> badstr = "qualité"[:-1]+".."
>>> badstr
'qualit\xc3..'
>>> json_str = '["%s"]' % badstr
>>> import json
>>> json.loads(json_str)
Traceback (most recent call last):
 ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 6: invalid \
continuation byte
>>> json.loads(json_str.decode('utf-8','ignore'))
[u'qualit..']

answered Nov 2, 2011 at 17:53

jfs

417k210 gold badges1k silver badges1.7k bronze badges

2 Comments

mcont Over a year ago

The answer from @Lucho includes an additional .encode, is it needed?

jfs Over a year ago

@Matteo: no. json format is defined for Unicode text and therefore .encode() after the .decode() is not necessary.

Collectives™ on Stack Overflow

Python: handle broken unicode bytes when parsing JSON string

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related