150

I have a file which stores many JavaScript objects in JSON form and I need to read the file, create each of the objects, and do something with them (insert them into a db in my case). The JavaScript objects can be represented a format:

Format A:

[{name: 'thing1'},
....
{name: 'thing999999999'}]

or Format B:

{name: 'thing1'}         // <== My choice.
...
{name: 'thing999999999'}

Note that the ... indicates a lot of JSON objects. I am aware I could read the entire file into memory and then use JSON.parse() like this:

fs.readFile(filePath, 'utf-8', function (err, fileContents) {
  if (err) throw err;
  console.log(JSON.parse(fileContents));
});

However, the file could be really large, I would prefer to use a stream to accomplish this. The problem I see with a stream is that the file contents could be broken into data chunks at any point, so how can I use JSON.parse() on such objects?

Ideally, each object would be read as a separate data chunk, but I am not sure on how to do that.

var importStream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
importStream.on('data', function(chunk) {

    var pleaseBeAJSObject = JSON.parse(chunk);           
    // insert pleaseBeAJSObject in a database
});
importStream.on('end', function(item) {
   console.log("Woot, imported objects into the database!");
});*/

Note, I wish to prevent reading the entire file into memory. Time efficiency does not matter to me. Yes, I could try to read a number of objects at once and insert them all at once, but that's a performance tweak - I need a way that is guaranteed not to cause a memory overload, not matter how many objects are contained in the file.

I can choose to use FormatA or FormatB or maybe something else, just please specify in your answer. Thanks!

1
  • For format B you could parse through the chunk for new lines, and extract each whole line, concatenating the rest if it cuts off in the middle. There may be a more elegant way though. I haven't worked with streams to much. Commented Aug 8, 2012 at 22:39

14 Answers 14

107

To process a file line-by-line, you simply need to decouple the reading of the file and the code that acts upon that input. You can accomplish this by buffering your input until you hit a newline. Assuming we have one JSON object per line (basically, format B):

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var buf = '';

stream.on('data', function(d) {
    buf += d.toString(); // when data is read, stash it in a string buffer
    pump(); // then process the buffer
});

function pump() {
    var pos;

    while ((pos = buf.indexOf('\n')) >= 0) { // keep going while there's a newline somewhere in the buffer
        if (pos == 0) { // if there's more than one newline in a row, the buffer will now start with a newline
            buf = buf.slice(1); // discard it
            continue; // so that the next iteration will start with data
        }
        processLine(buf.slice(0,pos)); // hand off the line
        buf = buf.slice(pos+1); // and slice the processed data off the buffer
    }
}

function processLine(line) { // here's where we do something with a line

    if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard CR (0x0D)

    if (line.length > 0) { // ignore empty lines
        var obj = JSON.parse(line); // parse the JSON
        console.log(obj); // do something with the data here!
    }
}

Each time the file stream receives data from the file system, it's stashed in a buffer, and then pump is called.

If there's no newline in the buffer, pump simply returns without doing anything. More data (and potentially a newline) will be added to the buffer the next time the stream gets data, and then we'll have a complete object.

If there is a newline, pump slices off the buffer from the beginning to the newline and hands it off to process. It then checks again if there's another newline in the buffer (the while loop). In this way, we can process all of the lines that were read in the current chunk.

Finally, process is called once per input line. If present, it strips off the carriage return character (to avoid issues with line endings – LF vs CRLF), and then calls JSON.parse one the line. At this point, you can do whatever you need to with your object.

Note that JSON.parse is strict about what it accepts as input; you must quote your identifiers and string values with double quotes. In other words, {name:'thing1'} will throw an error; you must use {"name":"thing1"}.

Because no more than a chunk of data will ever be in memory at a time, this will be extremely memory efficient. It will also be extremely fast. A quick test showed I processed 10,000 rows in under 15ms.

Sign up to request clarification or add additional context in comments.

9 Comments

This answer is now redundant. Use JSONStream, and you have out of the box support.
The function name 'process' is bad. 'process' should be a system variable. This bug confused me for hours.
@arcseldon I don't think the fact that there's a library that does this makes this answer redundant. It's certainly still useful to know how this can be done without the module.
I am not sure if this would work for a minified json file. What if the whole file was wrapped up in a single line, and using any such delimiters wasn't possible? How do we solve this problem then?
Third party libraries are not made of magic you know. They are just like this answer, elaborated versions of hand-rolled solutions, but just packed and labeled as a program. Understanding how things work is much more important and relevant than blindly throwing data into a library expecting results. Just saying :)
|
48

Just as I was thinking that it would be fun to write a streaming JSON parser, I also thought that maybe I should do a quick search to see if there's one already available.

Turns out there is.

Since I just found it, I've obviously not used it, so I can't comment on its quality, but I'll be interested to hear if it works.

It does work consider the following Javascript and _.isString:

stream.pipe(JSONStream.parse('*'))
  .on('data', (d) => {
    console.log(typeof d);
    console.log("isString: " + _.isString(d))
  });

This will log objects as they come in if the stream is an array of objects. Therefore the only thing being buffered is one object at a time.

2 Comments

Just the thing I was searching for since 2 days! Thanks a lot :)
@AtharvaKulkarni: JSONstream hasn't been maintained since 2018. You may want to evaluate stream-json or @streamparser/json.
42

As of October 2014, you can just do something like the following (using JSONStream) - https://www.npmjs.org/package/JSONStream

var fs = require('fs'),
    JSONStream = require('JSONStream'),

var getStream() = function () {
    var jsonData = 'myData.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
}

getStream().pipe(MyTransformToDoWhateverProcessingAsNeeded).on('error', function (err) {
    // handle any errors
});

To demonstrate with a working example:

npm install JSONStream event-stream

data.json:

{
  "greeting": "hello world"
}

hello.js:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    es = require('event-stream');

var getStream = function () {
    var jsonData = 'data.json',
        stream = fs.createReadStream(jsonData, { encoding: 'utf8' }),
        parser = JSONStream.parse('*');
    return stream.pipe(parser);
};

getStream()
    .pipe(es.mapSync(function (data) {
        console.log(data);
    }));
$ node hello.js
// hello world

5 Comments

This is mostly true and useful, but I think you need to do parse('*') or you won't get any data.
@JohnZwinck Thank you, have updated the answer, and added a working example to demonstrate it fully.
in the first code block, the first set of parentheses var getStream() = function () { should be removed.
This failed with an out of memory error with a 500mb json file.
30

I had similar requirement, i need to read a large json file in node js and process data in chunks and call a api and save in mongodb. inputFile.json is like:

{
 "customers":[
       { /*customer data*/},
       { /*customer data*/},
       { /*customer data*/}....
      ]
}

Now i used JsonStream and EventStream to achieve this synchronously.

var JSONStream = require("JSONStream");
var es = require("event-stream");

fileStream = fs.createReadStream(filePath, { encoding: "utf8" });
fileStream.pipe(JSONStream.parse("customers.*")).pipe(
  es.through(function(data) {
    console.log("printing one customer object read from file ::");
    console.log(data);
    this.pause();
    processOneCustomer(data, this);
    return data;
  }),
  function end() {
    console.log("stream reading ended");
    this.emit("end");
  }
);

function processOneCustomer(data, es) {
  DataModel.save(function(err, dataModel) {
    es.resume();
  });
}

4 Comments

Thank you so much for adding your answer, my case also needed some synchronous handling. However after testing it was not possible for me to call "end()" as a callback after the pipe is finished. I believe the only thing which could be done is adding an event, what should happen after the stream is 'finished' / 'close' with ´fileStream.on('close', ... )´.
Hey - this was a great solution BUT there's a type in your code. You have a parenthesis closing BEFORE [code]function end ()[/code] - but you need to move it afterward - otherwise end () is not included in the es.through().
This is great, now I just need to figure how to make it not stop at the first one sigh
28

I realize that you want to avoid reading the whole JSON file into memory if possible, however if you have the memory available it may not be a bad idea performance-wise. Using node.js's require() on a json file loads the data into memory really fast.

I ran two tests to see what the performance looked like on printing out an attribute from each feature from a 81MB geojson file.

In the 1st test, I read the entire geojson file into memory using var data = require('./geo.json'). That took 3330 milliseconds and then printing out an attribute from each feature took 804 milliseconds for a grand total of 4134 milliseconds. However, it appeared that node.js was using 411MB of memory.

In the second test, I used @arcseldon's answer with JSONStream + event-stream. I modified the JSONPath query to select only what I needed. This time the memory never went higher than 82MB, however, the whole thing now took 70 seconds to complete!

1 Comment

12

I wrote a module that can do this, called BFJ. Specifically, the method bfj.match can be used to break up a large stream into discrete chunks of JSON:

const bfj = require('bfj');
const fs = require('fs');

const stream = fs.createReadStream(filePath);

bfj.match(stream, (key, value, depth) => depth === 0, { ndjson: true })
  .on('data', object => {
    // do whatever you need to do with object
  })
  .on('dataError', error => {
    // a syntax error was found in the JSON
  })
  .on('error', error => {
    // some kind of operational error occurred
  })
  .on('end', error => {
    // finished processing the stream
  });

Here, bfj.match returns a readable, object-mode stream that will receive the parsed data items, and is passed 3 arguments:

  1. A readable stream containing the input JSON.

  2. A predicate that indicates which items from the parsed JSON will be pushed to the result stream.

  3. An options object indicating that the input is newline-delimited JSON (this is to process format B from the question, it's not required for format A).

Upon being called, bfj.match will parse JSON from the input stream depth-first, calling the predicate with each value to determine whether or not to push that item to the result stream. The predicate is passed three arguments:

  1. The property key or array index (this will be undefined for top-level items).

  2. The value itself.

  3. The depth of the item in the JSON structure (zero for top-level items).

Of course a more complex predicate can also be used as necessary according to requirements. You can also pass a string or a regular expression instead of a predicate function, if you want to perform simple matches against property keys.

Comments

5

If you have control over the input file, and it's an array of objects, you can solve this more easily. Arrange to output the file with each record on one line, like this:

[
   {"key": value},
   {"key": value},
   ...

This is still valid JSON.

Then, use the node.js readline module to process them one line at a time.

var fs = require("fs");

var lineReader = require('readline').createInterface({
    input: fs.createReadStream("input.txt")
});

lineReader.on('line', function (line) {
    line = line.trim();

    if (line.charAt(line.length-1) === ',') {
        line = line.substr(0, line.length-1);
    }

    if (line.charAt(0) === '{') {
        processRecord(JSON.parse(line));
    }
});

function processRecord(record) {
    // Process the records one at a time here! 
}

Comments

4

I solved this problem using the split npm module. Pipe your stream into split, and it will "Break up a stream and reassemble it so that each line is a chunk".

Sample code:

var fs = require('fs')
  , split = require('split')
  ;

var stream = fs.createReadStream(filePath, {flags: 'r', encoding: 'utf-8'});
var lineStream = stream.pipe(split());
linestream.on('data', function(chunk) {
    var json = JSON.parse(chunk);           
    // ...
});

Comments

1

I had around 2gb of json containing 25Million records in the same format. i wrote below code. as no any library was efficient even JSONStream or blah blah and regularly giving heap max error in node.

Use this code.

let CHUNK_SIZE = 500;
    let chunks = [];
    const readStream = fs.createReadStream('data.json', 'utf8');
    //Created stream of the json
    const rl = readline.createInterface({
        input: readStream,
        crlfDelay: Infinity
      });

      //below fn will run whenever we get the data from stream
      rl.on('line', (line) => {
        if(line){
            chunks.push(line);
            if(chunks.length >= CHUNK_SIZE){
                //once chunks lenght is 1000 we are pausing the stream. this will trigger the below on.pause event.
                rl.pause();
            }
        }
      });

      rl.on('pause', async () => {
        console.log('Readline paused.');
        await this.processJSONChunk(chunks);
        chunks.length = 0;
        setTimeout(() => rl.resume(),5 * 1000);
      }); 

      rl.on('resume',() => {
        console.log("==resumed===");
      })

      rl.on('close',async () => {
        await this.processJSONChunk(chunks);
        console.log("==ENDED===");
      })

in this code we are handling with events. because we don't want to load all the data into memory at once, this also prevent heap error.

Comments

1

The Wikipedia article about JSON streaming contains an overview over different ways how JSON values can be streamed. The most common formats are:

  • JSON Lines (also called JSONL, NDJSON or LDJSON): a stream of JSON documents separated by newlines
  • JSON-seq: a stream of JSON documents separated by \x1E record separator characters
  • JSON: a JSON document that is or contains an array/object whose elements/properties are streamed.

In your case, it seems like your data consists of a large number of JSON documents that are small in size, plus you have control over the format. In that particular case, I think using JSONL or JSON-seq is the best approach, since splitting up the stream into individual JSON values does not require parsing the actual JSON and you can use the native JSON.parse() on the individual values, making this approach more performant than any streaming JSON parser implemented in JavaScript. Here is an example way to consume JSONL:

import { createReadStream } from "node:fs/promises";
import { Readable } from "node:stream";

class JsonLinesParser extends TransformStream<string, any> {
    protected _buffer: string = "";

    constructor() {
        super({
            transform: (chunk, controller) => {
                let index: number;
                let rest = chunk;
                while ((index = rest.indexOf("\n")) !== -1) {
                    controller.enqueue(JSON.parse(`${this._buffer}${rest.slice(0, index + 1)}`));
                    rest = rest.slice(index + 1);
                    this._buffer = "";
                }

                if (rest.length > 0) {
                    this._buffer += rest;
                }
            },
            flush: (controller) => {
                if (this._buffer.length > 0) {
                    controller.enqueue(JSON.parse(this._buffer));
                }
            }
        });
    }
}

for await (const value of Readable.toWeb(createReadStream(filePath, "utf-8")).pipeThrough(new JsonLinesParser())) {
    console.log(value);
}

If you don’t have control over the format, there are some scenarios where it is not possible to simply split the stream and then parse the resulting chunks in a non-streaming way (like in the example above), but the JSON itself needs to be parsed in a streaming way. Such scenarios might be:

  • Concatenated JSON: The JSON documents follow each other without any separator.
  • Prettified JSONL: The JSON documents are separated by newlines, but they can also contain newlines themselves.
  • Large JSON documents: The individual JSON documents are so large that you want to parse them in a streaming way.

There are various libraries for parsing JSON in a streaming way, for example json-stream-es. Here would be an example how to consume the individual entries of one large JSON array:

import { createReadStream } from "node:fs/promises";
import { Readable } from "node:stream";
import { parseJsonStream } from "json-stream-es";

const stream = Readable.toWeb(createReadStream(filePath, "utf-8"))
    .pipeThrough(parseJsonStream([]));

for await (const value of stream) {
    console.log(value);
}

This example would consume a file containing concatenated JSON or prettified JSONL (it would also work for regular JSONL, but less performant than the solution above):

import { createReadStream } from "node:fs/promises";
import { Readable } from "node:stream";
import { parseJsonStream } from "json-stream-es";

const stream = Readable.toWeb(createReadStream(filePath, "utf-8"))
    .pipeThrough(parseJsonStream(undefined, { multi: true }));

for await (const value of stream) {
    console.log(value);
}

Comments

0

Using the @josh3736 answer, but for ES2021 and Node.js 16+ with async/await + AirBnb rules:

import fs from 'node:fs';

const file = 'file.json';

/**
 * @callback itemProcessorCb
 * @param {object} item The current item
 */

/**
 * Process each data chunk in a stream.
 *
 * @param {import('fs').ReadStream} readable The readable stream
 * @param {itemProcessorCb} itemProcessor A function to process each item
 */
async function processChunk(readable, itemProcessor) {
  let data = '';
  let total = 0;

  // eslint-disable-next-line no-restricted-syntax
  for await (const chunk of readable) {
    // join with last result, remove CR and get lines
    const lines = (data + chunk).replace('\r', '').split('\n');

    // clear last result
    data = '';

    // process lines
    let line = lines.shift();
    const items = [];

    while (line) {
      // check if isn't a empty line or an array definition
      if (line !== '' && !/[\[\]]+/.test(line)) {
        try {
          // remove the last comma and parse json
          const json = JSON.parse(line.replace(/\s?(,)+\s?$/, ''));
          items.push(json);
        } catch (error) {
          // last line gets only a partial line from chunk
          // so we add this to join at next loop
          data += line;
        }
      }

      // continue
      line = lines.shift();
    }

    total += items.length;

    // Process items in parallel
    await Promise.all(items.map(itemProcessor));
  }

  console.log(`${total} items processed.`);
}

// Process each item
async function processItem(item) {
  console.log(item);
}

// Init
try {
  const readable = fs.createReadStream(file, {
    flags: 'r',
    encoding: 'utf-8',
  });

  processChunk(readable, processItem);
} catch (error) {
  console.error(error.message);
}

For a JSON like:

[
  { "name": "A", "active": true },
  { "name": "B", "active": false },
  ...
]

Comments

0

First convert the .json to .ndjson using jq

jq -c ".[]" file.json > file.ndjson

Now you can use the below functions to read/write large JSON file.

const fs = require('fs/promises')
const fsSync = require('fs')
const readline = require('node:readline')
const stream = require("stream")
const { pipeline } = require('node:stream/promises')

async function readNDJSON(pathToNDJSON){
    let arr = []
    const rl = readline.createInterface({
        input: fsSync.createReadStream(pathToNDJSON),
        crlfDelay: Infinity,
      })
    for await (const line of rl) 
        arr.push(JSON.parse(line))
    return arr
}

// the json should be in records format i.e [{column1:value,column2:value},{column1:value,column2:value},...]
async function saveNDJSON(pathToSave, json){
    await pipeline(
        stream.Readable.from(json.map(e=>JSON.stringify(e)+'\n')),
        fsSync.createWriteStream(pathToSave),
    )
}

To convert .ndjson to .json you can use:

jq --slurp '.' file.ndjson > file.json

Comments

-2
https.get(url1 , function(response) {
  var data = ""; 
  response.on('data', function(chunk) {
    data += chunk.toString(); 
  }) 
  .on('end', function() {
    console.log(data)
  });
});

2 Comments

Please edit your answer and describe how this code resolves the problem of parsing a large JSON file.
Please read How do I write a good answer?. While this code block may answer the OP's question, this answer would be much more useful if you explain how this code is different from the code in the question, what you've changed, why you've changed it and why that solves the problem without introducing others.
-7

I think you need to use a database. MongoDB is a good choice in this case because it is JSON compatible.

UPDATE: You can use mongoimport tool to import JSON data into MongoDB.

mongoimport --collection collection --file collection.json

2 Comments

This doesn't answer the question. Note that the second line of the question says he wants to do this to get data into a database.
mongoimport only import file size upto 16MB.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.