Python & JSON: ValueError: Unterminated string starting at:

Question

I have read multiple StackOverflow articles on this and most of the top 10 Google results. Where my issue deviates is that I am using one script in python to create my JSON files. And the next script, run not 10 minutes later, can't read that very file.

Short version, I generate leads for my online business. I am attempting to learn python in order to have better analytics on these leads. I am scouring 2 years worth of leads with the intent being to retain the useful data and drop anything personal - email addresses, names, etc. - while also saving 30,000+ leads into a few dozen files for easy access.

So my first script opens every single individual lead file - 30,000+ - determines the date it was capture based on a timestamp in the file. Then it saves that lead to the appropriate key in dict. When all the data has been aggregated into this dict text files are written using json.dumps.

The dict's structure is:

addData['lead']['July_2013'] = { ... }

where the 'lead' key can be lead, partial, and a few others and the 'July_2013' key is obviously a date based key that can be any combination of the full month and 2013 or 2014 going back to 'February_2013'.

The full error is this:

ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)

But I've manually looked at the file and my IDE says there are only 76,655 chars in the file. So how did it get to 9997846?

The file that fails is the 8th to be read; the other 7 and all other files that come after it read in via json.loads just fine.

Python says there is in an unterminated string so I looked at the end of the JSON in the file that fails and it appears to be fine. I've seen some mention about newlines being \n in JSON but this string is all one line. I've seen mention of \ vs \ but in a quick look over the whole file I didn't see any . Other files do have \ and they read in fine. And, these files were all created by json.dumps.

I can't post the file because it still has personal info in it. Manually attempting to validate the JSON of a 76,000 char file isn't really viable.

Thoughts on how to debug this would be appreciated. In the mean time I am going to try to rebuild the files and see if this wasn't just a one off bug but that takes a while.

Python 2.7 via Spyder & Anaconda
Windows 7 Pro

--- Edit --- Per request I am posting the Write Code here:

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global archiveDir
global aggLeads


def aggregate_individual_lead_files():
    """

    """

    # Get the aggLead global and 
    global aggLeads

    # Get all the Files with a 'lead' extension & aggregate them
    exts = [
        'lead',
        'partial',
        'inp',
        'err',
        'nobuyer',
        'prospect',
        'sent'
    ]

    for srchExt in exts:
        agg = {}
        leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
        print "There are {} {} files to process".format(len(leads), srchExt)

        for lead in leads:
            # Get the Base Filename
            fname = f.basename(lead)
            #uniqID = st.fetchBefore('.', fname)

            #print "File: ", lead

            # Get Lead Data
            leadData = json.loads(f.file_get_contents(lead))

            agg = agg_data(leadData, agg, fname)

        aggLeads[srchExt] = copy.deepcopy(agg)

        print "Aggregate Top Lvl Keys: ", aggLeads.keys()
        print "Aggregate Next Lvl Keys: "

        for key in aggLeads:
            print "{}: ".format(key)

            for arcDate in aggLeads[key].keys():
                print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))

        # raw_input("Press Enter to continue...")


def agg_data(leadData, agg, fname=None):
    """

    """
    #print "Lead: ", leadData

    # Get the timestamp of the lead
    try:
        ts = leadData['timeStamp']
        leadData.pop('timeStamp')
    except KeyError:
        return agg

    leadDate = datetime.fromtimestamp(ts)
    arcDate = leadDate.strftime("%B_%Y")

    #print "Archive Date: ", arcDate

    try:
        agg[arcDate][ts] = leadData
    except KeyError:
        agg[arcDate] = {}
        agg[arcDate][ts] = leadData
    except TypeError:
        print "Timestamp: ", ts
        print "Lead: ", leadData
        print "Archive Date: ", arcDate
        return agg

    """
    if fname is not None:
        archive_lead(fname, arcDate)
    """

    #print "File: {} added to {}".format(fname, arcDate)

    return agg


def archive_lead(fname, arcDate):
    # Archive Path
    newArcPath = archiveDir + arcDate + '//'

    if not os.path.exists(newArcPath):
        os.makedirs(newArcPath)

    # Move the file to the archive
    os.rename(leadDir + fname, newArcPath + fname)


def reformat_old_agg_data():
    """

    """

    # Get the aggLead global and 
    global aggLeads
    aggComplete = {}
    aggPartial = {}

    oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
    print "There are {} old aggregate files to process".format(len(oldAggFiles))

    for agg in oldAggFiles:
        tmp = json.loads(f.file_get_contents(agg))

        for uniqId in tmp:
            leadData = tmp[uniqId]

            if leadData['isPartial'] == True:
                aggPartial = agg_data(leadData, aggPartial)
            else:
                aggComplete = agg_data(leadData, aggComplete)

    arcData = dict(aggLeads['lead'].items() + aggComplete.items())
    aggLeads['lead'] = arcData

    arcData = dict(aggLeads['partial'].items() + aggPartial.items())
    aggLeads['partial'] = arcData    


def output_agg_files():
    for ext in aggLeads:
        for arcDate in aggLeads[ext]:
            arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'

            if f.file_exists(arcFile):
                tmp = json.loads(f.file_get_contents(arcFile))
            else:
                tmp = {}

            arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())

            f.file_put_contents(arcFile, json.dumps(arcData))


def main():
    global leadDir
    global archiveDir
    global aggLeads

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    archiveDir = leadDir + 'archive//'
    aggLeads = {}


    # Aggregate all the old individual file
    aggregate_individual_lead_files()

    # Reformat the old aggregate files
    reformat_old_agg_data()

    # Write it all out to an aggregate file
    output_agg_files()


if __name__ == "__main__":
    main()

Here is the read code:

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global fields
global fieldTimes
global versions


def parse_agg_file(aggFile):
    global leadDir
    global fields
    global fieldTimes

    try:
        tmp = json.loads(f.file_get_contents(aggFile))
    except ValueError:
        print "{} failed the JSON load".format(aggFile)
        return False

    print "Opening: ", aggFile

    for ts in tmp:
        try:
            tmpTs = float(ts)
        except:
            print "Timestamp: ", ts
            continue

        leadData = tmp[ts]

        for field in leadData:
            if field not in fields:
                fields[field] = []

            fields[field].append(float(ts))


def determine_form_versions():
    global fieldTimes
    global versions

    # Determine all the fields and their start and stop times
    times = []
    for field in fields:
        minTs = min(fields[field])
        fieldTimes[field] = [minTs, max(fields[field])]
        times.append(minTs)
        print 'Min ts: {}'.format(minTs)

    times = set(sorted(times))
    print "Times: ", times
    print "Fields: ", fieldTimes

    versions = {}
    for ts in times:
        d = datetime.fromtimestamp(ts)
        ver = d.strftime("%d_%B_%Y")

        print "Version: ", ver

        versions[ver] = []
        for field in fields:
            if ts in fields[field]:
                versions[ver].append(field)


def main():
    global leadDir
    global fields
    global fieldTimes

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    fields = {}
    fieldTimes = {}

    aggFiles = f.glob(leadDir + '*.lead.agg')

    for aggFile in aggFiles:
        parse_agg_file(aggFile)

    determine_form_versions()

    print "Versions: ", versions




if __name__ == "__main__":
    main()

You're going to need to show us your code. If it has personal info, remove the info. — jwodder
– jwodder, Commented Oct 24, 2014 at 4:07
You really want me to post a JSON file with 76,000 chars in it? The code works for every other file but this. So I'm sure it has something to do with one of the fields that got saved. It contains the personal info of a couple thousand people. I could manually type the data into a new file almost as fast as I could remove that personal data from this one. And unless PHP or something can open this file the removal would have to be done manually. If no one can provide pointers without the full file then I'll delete the question. — Gabe Spradlin
– Gabe Spradlin, Commented Oct 24, 2014 at 4:41
No, I asked for you to show us your code — the Python, not the JSON. — jwodder
– jwodder, Commented Oct 24, 2014 at 4:42
Which part, the write or read? or both? There is no personal data in the code only in the data files. — Gabe Spradlin
– Gabe Spradlin, Commented Oct 24, 2014 at 4:43
p2p is my own package used to create python functions with the same names and inputs as PHP ones. It just makes picking up a new language much faster in my experience. — Gabe Spradlin
– Gabe Spradlin, Commented Oct 24, 2014 at 4:51

Samantha · Accepted Answer · 2015-02-23 23:41:32Z

23

I got the same problem. As it turned out, the last line of the file was incomplete probably due to the abrupt halt of the download as I found there was enough data and simply stopped the process on the terminal.

edited Feb 23, 2015 at 23:41

answered Feb 23, 2015 at 21:19

Samantha

3312 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jean Monet Over a year ago

Same here, apparently the data received from the API was sometimes incomplete.

kkgarg Over a year ago

In my case, the download (or file being used was still being downloaded) was not complete.

I159 · Accepted Answer · 2018-12-05 09:09:40Z

23

So I figured it out... I post this answer just in case someone else makes the same error.

First, I found a work around but I wasn't sure why this worked. From my original code, here is my file_get_contents function:

def file_get_contents(fname):
    if s.stripos(fname, 'http://'):
        import urllib2
        return urllib2.urlopen(fname).read(maxUrlRead)
    else:
        return open(fname).read(maxFileRead)

I used it via:

tmp = json.loads(f.file_get_contents(aggFile))

This failed, over and over and over again. However, as I was attempting to get Python to at least give me the JSON string to put through a JSON validator I came across mention of json.load vs json.loads. So I tried this instead:

a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)

While I haven't tested this output in my overall code this code chunk does in fact read in the file, decode the JSON, and will even display the data without crashing Spyder. The variable explorer in Spyder shows that b is a dict of size 1465 and that is exactly how many records it should have. The portion of the displayed text from the end of the dict all looks good. So overall I have a reasonably high level confidence that the data was parsed correctly.

When I wrote the file_get_contents function I saw several recommendations that I always provide a max number of bytes to read so as to prevent Python from hanging on a bad return. The value of maxReadFile was 1E7. When I manually forced maxReadFile to be 1E9 everything worked fine. Turns out the file is just under 1.2E7 bytes. So the resulting string from reading the file was not the full string in the file and as a result was invalid JSON.

Normally I would think this is a bug but clearly when opening and reading a file you need to be able to read just a chunk at a time for memory management. So I got bit by my own shortsightedness with regards to the maxReadFile value. The error message was correct but sent me off on a wild goose chase.

Hopefully this could save someone else some time.

edited Dec 5, 2018 at 9:09

I159

31.4k32 gold badges102 silver badges138 bronze badges

answered Oct 24, 2014 at 14:47

Gabe Spradlin

2,0674 gold badges25 silver badges49 bronze badges

2 Comments

Mutlu Simsek Over a year ago

How to force maxReadFile to 1e9?

Gabe Spradlin Over a year ago

@MutluSimsek maxReadFile is just a variable. So 'maxReadFile = 1E9'. This question is 7 years old and I haven't had to do this in a while. I apologize upfront if I'm forgetting something.

ssi-anik · Accepted Answer · 2019-11-15 13:26:24Z

2

If someone is here just like I am and if you're handling json from the form requests then check if there is any Content-Length header set or not. I was getting this error because of that header. I used the JSON beautification and found the json became quite large which raised this error.

answered Nov 15, 2019 at 13:26

ssi-anik

3,8144 gold badges28 silver badges52 bronze badges

Comments

Abdelhadi Zahir · Accepted Answer · 2021-07-13 14:17:47Z

I had the same problem while importing a json file that I've created, but when I import another json file It works even without changing anything in code. What I found different in the json file that I have created is that the content is on one line.

enter image description here

The cause of having that shape is that I dumped the dictionnary while writing the file like this :

with open("sample.json", "w") as outfile: 
    json.dump(dictionnary, outfile)

But once I dumped the dictionnary alone then writing it :

json_object = json.dumps(dictionary, indent = 4) 

    with open("sample.json", "w") as outfile: 
        outfile.write(json_object)

I had the known and standard shape of json file :

enter image description here

So know using this json file and import it we'll not have the issue.

Alon Samuel · Accepted Answer · 2022-02-01 16:52:41Z

0

I had a similar problem and apparently, the file was corrupted. What helped me understand is to do

    with open("path/to/file", 'r') as f:
          raw_data = f.read()

And then I saw that the string ends abruptly. Then, I took a smaller portion of the data after inspecting it.

index_of_the_end_of_last_record = 100 # For example
data = json.loads(raw_data[:index_of_the_end_of_last_record]+']')

I've added a - ] because I had a list inside the json file.

answered Feb 1, 2022 at 16:52

Alon Samuel

3751 gold badge2 silver badges10 bronze badges

Collectives™ on Stack Overflow

Python & JSON: ValueError: Unterminated string starting at:

5 Answers 5

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related