How to extract multiple JSON objects from one file?

Question

I am very new to Json files. If I have a json file with multiple json objects such as following:

{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
 "Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
 "Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
 "Code":[{"event1":"B","result":"0"},…]}
…

I want to extract all "Timestamp" and "Usefulness" into a data frames:

    Timestamp    Usefulness
 0   20140101      Yes
 1   20140102      No
 2   20140103      No
 …

Does anyone know a general way to deal with such problems?

having a single json array containing all your json object would be quite easier — njzk2
– njzk2, Commented Jan 12, 2015 at 17:33
https://stackoverflow.com/questions/53788395/tweets-streamed-using-tweepy-reading-json-file-in-python/53789187#53789187 — Diego Marino
– Diego Marino, Commented Dec 15, 2018 at 4:05

Dunes · Accepted Answer · 2025-01-30 22:18:53Z

99

Update: I wrote a solution that does not require reading the entire file in one go. It is too big for a stackoverflow answer, but can be found here jsonstream.

You can use json.JSONDecoder.raw_decode to decode arbitarily big strings of "stacked" JSON (so long as they can fit in memory). raw_decode stops once it has a valid object and returns the last position where was not part of the parsed object. It is poorly documented [1] (see footer), but you can pass this position back to raw_decode and it start parsing again from that position. Unfortunately, the Python json module doesn ot accept strings that have prefixing whitespace. So we need to search to find the first non-whitespace part of your document.

from json import JSONDecoder, JSONDecodeError
import re

NOT_WHITESPACE = re.compile(r'\S')

def decode_stacked(document, idx=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, idx)
        if not match:
            return
        idx = match.start()
        
        try:
            obj, idx = decoder.raw_decode(document, idx)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s = """

{"a": 1}  


   [
1
,   
2
]


"""

for obj in decode_stacked(s):
    print(obj)

prints:

{'a': 1}
[1, 2]

Note About Missing Documentation

The current signature of raw_decode() dates from 2009, when simplejson was ported into the standard library. The documentation for raw_decode() in simplejson mentions an optional idx argument that can be used to start parsing at an offset. Given that the signature of raw_decode() has not changed since 2009, I think it is fair to assume the API is fairly stable. Especially as decode() uses the idx argument of raw_decode() to ignore prefixing whitespace when parsing a string. And this is exactly what this answer is using the idx argument for too. The documentation of raw_decode() in simplejson is:

raw_decode(s[, idx=0])

Decode a JSON document from s (a str or unicode beginning with a JSON document) starting from the index idx and return a 2-tuple of the Python representation and the index in s where the document ended.

This can be used to decode a JSON document from a string that may have extraneous data at the end, or to decode a string that has a series of JSON objects.

JSONDecodeError will be raised if the given JSON document is not valid.

edited Jan 30 at 22:18

answered May 17, 2018 at 6:08

Dunes

42.1k7 gold badges86 silver badges107 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

martineau Over a year ago

I, too, like this answer quite a bit except for a couple of things: It requires reading the entire file into memory and its use of undocumented features of the JSONDecoder.

Abilash Amarasekaran Over a year ago

This works for AWS Lambda if the file has single line multi JSON file.. Can you explain in more details how this works? I m not able to understand raw_decode or how it can understand when a valid json starts or ends

ggorlen Over a year ago

@AbilashAmarasekaran did you check the docs for raw_decode? It slurps up a JSON document chunk from the string, leaving the rest untouched. The loop here skips leading whitespace after the last chunk and prepares the string for the next raw_decode call using the undocumented pos argument as the offset. You could use slicing, as in this answer which might be a bit slower, but uses a fully-documented API.

Abilash Amarasekaran Over a year ago

Thx @ggorlen I kind of guessed that is what it must be doing.

systemBuilder Jan 28 at 20:06

This does not have to use undocumented features of the JSONDecoder. you can just call: pos_total = pos; obj, pos = decoder.raw_decode(document[raw_pos:]); pos_total = pos_total+ pos to start decoding at position "pos" in the string. Make sure to add the return "pos" value into a "pos_total"

|

mgrollins · Accepted Answer · 2019-04-23 18:59:42Z

38

Use a json array, in the format:

[
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
  "Code":[{"event1":"A","result":"1"},…]},
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
  "Code":[{"event1":"B","result":"1"},…]},
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
  "Code":[{"event1":"B","result":"0"},…]},
...
]

Then import it into your python code

import json

with open('file.json') as json_file:

    data = json.load(json_file)

Now the content of data is an array with dictionaries representing each of the elements.

You can access it easily, i.e:

data[0]["ID"]

edited Apr 23, 2019 at 18:59

mgrollins

6513 silver badges9 bronze badges

answered Jan 12, 2015 at 17:49

dfranca

5,3622 gold badges36 silver badges65 bronze badges

3 Comments

exa Over a year ago

This is cool, but prevents you to use the file as an endless stream (e.g. log-like append-only file data) and consumes a lot more memory.

David Culbreth Over a year ago

@exa, this is true, but if you need append-only logging for this data stream, perhaps you should be looking at a format other than JSON to transfer your information, as JSON requires the closing bracket for all data structures, implying a non-infinite non-stream format.

ggorlen Over a year ago

This doesn't really answer the question generally. OP is asking about how to process a stream of JSON objects, not bracketed, comma-delimited JSON. A valid JSON stream might look like {"a": {"b": 42}}{"c": 3} which this doesn't help with parsing.

Dan Temkin · Accepted Answer · 2021-12-22 20:09:09Z

12

So, as was mentioned in a couple comments containing the data in an array is simpler but the solution does not scale well in terms of efficiency as the data set size increases. You really should only use an iterable object when you want to access a random item in the array, otherwise, generators are the way to go. Below I have prototyped a reader function which reads each json object individually and returns a generator.

The basic idea is to signal the reader to split on the carriage character "\n" (or "\r\n" for Windows). Python can do this with the file.readline() function.

import json
def json_reader(filename):
    with open(filename) as f:
        for line in f:
            yield json.loads(line)

However, this method only really works when the file is written as you have it -- with each object separated by a newline character. Below I wrote an example of a writer that separates an array of json objects and saves each one on a new line.

def json_writer(file, json_objects):
    with open(file, "w") as f:
        for jsonobj in json_objects:
            jsonstr = json.dumps(jsonobj)
            f.write(jsonstr + "\n")

You could also do the same operation with file.writelines() and a list comprehension:

...
    json_strs = [json.dumps(j) + "\n" for j in json_objects]
    f.writelines(json_strs)
...

And if you wanted to append the data instead of writing a new file just change open(file, "w") to open(file, "a").

In the end I find this helps a great deal not only with readability when I try and open json files in a text editor but also in terms of using memory more efficiently.

On that note if you change your mind at some point and you want a list out of the reader, Python allows you to put a generator function inside of a list and populate the list automatically. In other words, just write

lst = list(json_reader(file))

edited Dec 22, 2021 at 20:09

answered Dec 24, 2017 at 7:09

Dan Temkin

1,6451 gold badge15 silver badges18 bronze badges

3 Comments

Clément Over a year ago

What does "You really should only use an iterator when you want to access a random object in the array" mean? Did you mean "list" instead of "iterator"?

Dan Temkin Over a year ago

@Clément I meant Iterable. That's my bad.

Clément Over a year ago

Iterable doesn't provide random access, AFAIK

Fantix King · Accepted Answer · 2019-10-17 23:46:11Z

7

Added streaming support based on the answer of @dunes:

import re
from json import JSONDecoder, JSONDecodeError

NOT_WHITESPACE = re.compile(r"[^\s]")


def stream_json(file_obj, buf_size=1024, decoder=JSONDecoder()):
    buf = ""
    ex = None
    while True:
        block = file_obj.read(buf_size)
        if not block:
            break
        buf += block
        pos = 0
        while True:
            match = NOT_WHITESPACE.search(buf, pos)
            if not match:
                break
            pos = match.start()
            try:
                obj, pos = decoder.raw_decode(buf, pos)
            except JSONDecodeError as e:
                ex = e
                break
            else:
                ex = None
                yield obj
        buf = buf[pos:]
    if ex is not None:
        raise ex

edited Oct 17, 2019 at 23:46

answered Oct 17, 2019 at 23:40

Fantix King

1,4941 gold badge14 silver badges13 bronze badges

2 Comments

Eli Burke Over a year ago

This is great, thanks! If you are processing large data files, crank up the block size (about 4MB benchmarked the fastest for me on files from 10MB-2GB) otherwise you get a lot of spurious exceptions from raw_decode which slows it way down.

Joe Over a year ago

You can use it like: log = [] with open(log_file, 'r') as f: for record in stream_json(f): log.append(record)

pavon · Accepted Answer · 2023-03-15 19:52:19Z

0

This is what I do. It assumes there will be a newline between each object, but allows each object to span multiple lines.

import json
def json_reader(filename):
    with open(filename) as f:
        text = ""
        error = None
        for line in f:
            text += line
            try:
               yield json.loads(text)
               text = ""
               e = None
            except e:
               error = e
        if error is not None:
            raise error

It isn't super efficient, since it attempts to parse the partial json text multiple times, but it is often better than loading the entire file into memory, and avoids adding another dependency.

answered Mar 15, 2023 at 19:52

pavon

16.6k1 gold badge25 silver badges26 bronze badges

Comments

TJR · Accepted Answer · 2024-11-14 00:14:45Z

The exception was thrown in the JSONDecoder, so maybe we can reuse that.

def json_iter(text: str):
    whitespace = re.compile(r'[ \t\n\r]*', re.VERBOSE | re.MULTILINE | re.DOTALL).match
    decoder = json.JSONDecoder()
    end = -1
    while True:
        end = whitespace(text, end).end()
        if end == len(text):
            return
        pos = end
        obj, end = decoder.raw_decode(text, idx=pos)
        yield obj


for idx, obj in enumerate(json_iter("[1][2][3][true][false]true false 1 2 3 4 5 6 7 8 9 10 {}")):
    sys.stdout.write(f"[{idx}] => {obj}\n")

[0] => [1]
[1] => [2]
[2] => [3]
[3] => [True]
[4] => [False]
[5] => True
[6] => False
[7] => 1
[8] => 2
[9] => 3
[10] => 4
[11] => 5
[12] => 6
[13] => 7
[14] => 8
[15] => 9
[16] => 10
[17] => {}

Collectives™ on Stack Overflow

How to extract multiple JSON objects from one file?

6 Answers 6

Note About Missing Documentation

7 Comments

3 Comments

3 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Note About Missing Documentation

7 Comments

3 Comments

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related