Why am I getting a UnicodeDecodeError in Python's JSON encoding?

Question

I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
    request_string.append(param_list)
  File "C:\Python27\lib\json\__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "C:\Python27\lib\json\encoder.py", line 203, in encode
    chunks = list(chunks)
  File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
    for chunk in _iterencode_list(o, _current_indent_level):
  File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
    for chunk in chunks:
  File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
    yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte

Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?

Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:

'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'

How can I configure it so that it can be encodable into JSON?

Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.

Did you read the error? You have a byte that isn't valid utf-8 in your data. This has nothing to do with the size. Show us the code and look at the data in that particular field. — agf
– agf, Commented Aug 22, 2011 at 10:26
For a trivial counterexample, json.dumps({"megastring": "-" * 1000000}) produces a one megabyte JSON object. — user1114
– user1114, Commented Aug 22, 2011 at 10:38

YOU · Accepted Answer · 2011-08-22 15:08:47Z

15

Simple, just don't use utf-8 encoding if your data is not in utf-8

>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']

json.loads

If s is a str instance and is encoded with an ASCII based encoding other than utf-8 (e.g. latin-1) then an appropriate encoding name must be specified. Encodings that are not ASCII based (such as UCS-2) are not allowed and should be decoded to unicode first.

Edit: To get proper unicode value of "\x96" use "cp1252" as Eli Collins mentioned

>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']

edited Aug 22, 2011 at 15:08

answered Aug 22, 2011 at 10:35

YOU

124k34 gold badges191 silver badges222 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eli Collins Over a year ago

Just to nitpick - latin-1 doesn't define a character for \x96, though the python codec will willingly decode it (but as a raw byte value, not a particular character). The correct codec is probably cp1252 (microsoft's extension of latin-1), which defines byte \x96 as unicode char 2012 (en-dash). Pretty much any ascii-looking encoding with a bunch of \x90-\x9F chars is likely to be cp1252, as windows systems generate these characters (smart quotes, etc) a lot.

YOU Over a year ago

@Eli, Thank you, updated the post. I got 2013 on my ubuntu though.

Eli Collins Over a year ago

Doh. 2013 is what my console prints too. I need some coffee :)

Collectives™ on Stack Overflow

Why am I getting a UnicodeDecodeError in Python's JSON encoding?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related