I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
request_string.append(param_list)
File "C:\Python27\lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 203, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte
Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?
Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:
'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'
How can I configure it so that it can be encodable into JSON?
Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.
utf-8in your data. This has nothing to do with the size. Show us the code and look at the data in that particular field.json.dumps({"megastring": "-" * 1000000})produces a one megabyte JSON object.