How to retrieve multiple JSON objects from a text file where the objects are not separated by a delimiter?

Question

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects.

The objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:

{field1: {}, field2: "some value", field3: {}, ...}

and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().

Any suggestion on how I can solve this problem. Is there a known parser to do this?

are they at least separated onto different lines, or it is just one long single-line {...}{...}{...} pileup? — Marc B
– Marc B, Commented Jan 4, 2012 at 16:24
Could you add delimiters using str.replace? As in: single_line_json.replace('}{',}\n{') — aganders3
– aganders3, Commented Jan 4, 2012 at 16:40
if you need an even faster solution you can avoid the large object list by switching to a generator: while end != s_len: obj, end = decoder.raw_decode(s, idx=end) yield obj. — tback
– tback, Commented Jan 4, 2012 at 17:54

tback · Accepted Answer · 2012-01-04 18:19:20Z

This decodes your "list" of JSON Objects from a string:

from json import JSONDecoder

def loads_invalid_obj_list(s):
    decoder = JSONDecoder()
    s_len = len(s)

    objs = []
    end = 0
    while end != s_len:
        obj, end = decoder.raw_decode(s, idx=end)
        objs.append(obj)

    return objs

The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.

Examples

>>> loads_invalid_obj_list('{}{}')
[{}, {}]

>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "decode.py", line 9, in loads_invalid_obj_list
    obj, end = decoder.raw_decode(s, idx=end)
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)

Clean Solution (added later)

import json
import re

#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)

class ConcatJSONDecoder(json.JSONDecoder):
    def decode(self, s, _w=WHITESPACE.match):
        s_len = len(s)

        objs = []
        end = 0
        while end != s_len:
            obj, end = self.raw_decode(s, idx=_w(s, end).end())
            end = _w(s, end).end()
            objs.append(obj)
        return objs

Examples

>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]

>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]

>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "decode.py", line 15, in decode
    obj, end = self.raw_decode(s, idx=_w(s, end).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)

Really cool, I was hoping that the json module would have something like this and it has. It's perfect. Thank you!

Tadeck · Accepted Answer · 2012-01-04 17:01:39Z

4

Solution

As far as I know }{ does not appear in valid JSON, so the following should be perfectly safe when trying to get strings for separate objects that were concatenated (txt is the content of your file). It does not require any import (even of re module) to do that:

retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))

or if you prefer list comprehensions (as David Zwicker mentioned in the comments), you can use it like that:

retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]

It will result in retrieved_strings being a list of strings, each containing separate JSON object. See proof here: http://ideone.com/Purpb

Example

The following string:

'{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'

will be turned into:

['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']

as proven in the example I mentioned.

edited Jan 4, 2012 at 17:01

answered Jan 4, 2012 at 16:52

Tadeck

138k28 gold badges155 silver badges201 bronze badges

5 Comments

David Zwicker Over a year ago

This should be done using a list comprehension retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{')]

Tadeck Over a year ago

@DavidZwicker: why? Are you one of the supporters of the map() function considered as deprecated? It is perfectly valid. However it may look simpler, I will add this to my answer.

soulcheck Over a year ago

valid json with }{ : '{"f1" : "}{}{", "b" : "{{}{}}{{{}{}"}'

David Zwicker Over a year ago

@Tadeck: See stackoverflow.com/questions/1247486/… for a discussion on map vs list-comprehension. I actually use map myself sometimes, but only on occasions, where the function already exists. Using lambda in conjunction with map does not make a lot of sense to me.

Tadeck Over a year ago

@soulcheck: +1, very good point! It still can be solved, but now it requires checking if the }{ sequence occurs within quotes...

Gino Mempin · Accepted Answer · 2023-03-21 03:30:17Z

4

Sebastian Blask's answer has the right idea, but there's no reason to use regexes for such a simple change.

objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))

Or, more legibly

raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json

edited Mar 21, 2023 at 3:30

Gino Mempin

30.5k31 gold badges125 silver badges174 bronze badges

answered Jan 4, 2012 at 17:02

Patrick Perini

22.6k12 gold badges62 silver badges91 bronze badges

Comments

Joshua · Accepted Answer · 2012-01-04 16:59:16Z

3

How about something like this:

import re
import json

jsonstr = open('test.json').read()

p = re.compile( '}\s*{' )
jsonstr = p.sub( '}\n{', jsonstr )

jsonarr = jsonstr.split( '\n' )

for jsonstr in jsonarr:
   jsonobj = json.loads( jsonstr )
   print json.dumps( jsonobj )

answered Jan 4, 2012 at 16:59

Joshua

1331 silver badge5 bronze badges

Comments

Gino Mempin · Accepted Answer · 2023-03-21 03:29:11Z

3

You can load the file as a string, replace all }{ with },{ and surround the whole thing with []?

Something like:

re.sub('\}\s*?\{', '\}, \{', string_read_from_a_file)

Or a simple string replace if you are sure you always have }{ without whitespaces in between.

In case you expect }{ to occur in strings as well, you could also split on }{ and evaluate each fragment with json.load, and in case you get an error, the fragment wasn't complete and you have to add the next to the first one and so forth.

edited Mar 21, 2023 at 3:29

Gino Mempin

30.5k31 gold badges125 silver badges174 bronze badges

answered Jan 4, 2012 at 16:39

Sebastian Blask

2,9382 gold badges19 silver badges30 bronze badges

2 Comments

Lejlek Over a year ago

Cool! That's clever and easy to do. I'll try it and come back with the result. Thank you!

soulcheck Over a year ago

what happens if you have '}{' string in some other places, like property values? for example: '{"field1" : "}{123", "field2" : "123"}'

redrah · Accepted Answer · 2012-01-04 16:45:07Z

1

How about reading through the file incrementing a counter every time a { is found and decrementing it when you come across a }. When your counter reaches 0 you'll know that you've come to the end of the first object so send that through json.load and start counting again. Then just repeat to completion.

answered Jan 4, 2012 at 16:45

redrah

1,20213 silver badges20 bronze badges

Comments

Swapnil · Accepted Answer · 2015-09-22 18:53:25Z

1

import json

file1 = open('filepath', 'r')
data = file1.readlines()

for line in data :
   values = json.loads(line)

'''Now you can access all the objects using values.get('key') '''

answered Sep 22, 2015 at 18:53

Swapnil

1692 silver badges12 bronze badges

1 Comment

Gino Mempin Over a year ago

This only works if the JSON objects are separated by newlines, but OP already said in a comment that the objects are in 1 single line: stackoverflow.com/questions/8730119/…

Scott Hunter · Accepted Answer · 2012-01-04 16:29:30Z

0

Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?

answered Jan 4, 2012 at 16:29

Scott Hunter

50k12 gold badges64 silver badges107 bronze badges

1 Comment

Lejlek Over a year ago

Oh, I see your point. Are you suggesting to use a try/except and then split whenever the column index shows? I tried it quickly and I get the exception: "Expecting , delimiter: line 1 column 1332 (char 1332). It's doable. I was just hoping that there was a parser out there, since it seems like something that might happen. But thanks for this suggestion.

Spencer · Accepted Answer · 2012-01-04 17:02:29Z

0

Replace a file with that junk in it:

$ sed -i -e 's;}{;}, {;g' foo

Do it on the fly in Python:

junkJson.replace('}{', '}, {')

answered Jan 4, 2012 at 17:02

Spencer

6853 silver badges8 bronze badges

Collectives™ on Stack Overflow

How to retrieve multiple JSON objects from a text file where the objects are not separated by a delimiter?

9 Answers 9

Clean Solution (added later)

1 Comment

Solution

Example

5 Comments

Comments

Comments

2 Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Clean Solution (added later)

1 Comment

Solution

Example

5 Comments

Comments

Comments

2 Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related