JSON.parse() on a large array of objects is using way more memory than it should

Question

I generate a ~200'000-element array of objects (using object literal notation inside map rather than new Constructor()), and I'm saving a JSON.stringify'd version of it to disk, where it takes up 31 MB, including newlines and one-space-per-indentation level (JSON.stringify(arr, null, 1)).

Then, in a new node process, I read the entire file into a UTF-8 string and pass it to JSON.parse:

var fs = require('fs');
var arr1 = JSON.parse(fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}));

Node memory usage is about 1.05 GB according to Mavericks' Activity Monitor! Even typing into a Terminal feels laggier on my ancient 4 GB RAM machine.

But if, in a new node process, I load the file's contents into a string, chop it up at element boundaries, and JSON.parse each element individually, ostensibly getting the same object array:

var fs = require('fs');
var arr2 = fs.readFileSync('JMdict-all.json', {encoding : 'utf8'}).trim().slice(1,-3).split('\n },').map(function(s) {return JSON.parse(s+'}');});

node is using just ~200 MB of memory, and no noticeable system lag. This pattern persists across many restarts of node: JSON.parseing the whole array takes a gig of memory while parsing it element-wise is much more memory-efficient.

Why is there such a huge disparity in memory usage? Is this a problem with JSON.parse preventing efficient hidden class generation in V8? How can I get good memory performance without slicing-and-dicing strings? Must I use a streaming JSON parse 😭?

For ease of experimentation, I've put the JSON file in question in a Gist, please feel free to clone it.

The memory consumed by a process means nothing. Literally, you cannot reason about your code memory consumption efficiency based on that. — zerkms
– zerkms, Commented Jun 1, 2015 at 2:13
@zerkms thanks for pointing that out. I should have noted that my system (4 GB physical RAM) actually feels laggier as soon as I try the first method: I can tell even when typing in Terminal. — Ahmed Fasih
– Ahmed Fasih, Commented Jun 1, 2015 at 2:14
Huh. If I start node --expose-gc, run the first code snippet (using up 1 GB memory), and run global.gc(); about fifty times, node memory usage slowly drops to 100~ MB. The implications—wow. — Ahmed Fasih
– Ahmed Fasih, Commented Jun 1, 2015 at 2:26
gist.githubusercontent.com/fasiha/909090f86ab5d9e12985/raw/… displaying "Error: blob is too big" — guest271314
– guest271314, Commented Jun 1, 2015 at 2:35
@guest271314 sorry, Github won't show you raw files since they're too big, but you can get the repo via git clone https://gist.github.com/909090f86ab5d9e12985.git. Or if you just want to look at a bit of the JSON file, Github will show a few thousand lines gist.github.com/fasiha/909090f86ab5d9e12985/revisions — Ahmed Fasih
– Ahmed Fasih, Commented Jun 1, 2015 at 2:37

Michael Geary · Accepted Answer · 2015-06-01 04:42:12Z

8

A few points to note:

You've found that, for whatever reason, it's much more efficient to do individual JSON.parse() calls on each element of your array instead of one big JSON.parse().
The data format you're generating is under your control. Unless I misunderstood, the data file as a whole does not have to be valid JSON, as long as you can parse it.
It sounds like the only issue with your second, more efficient method is the fragility of splitting the original generated JSON.

This suggests a simple solution: Instead of generating one giant JSON array, generate an individual JSON string for each element of your array - with no newlines in the JSON string, i.e. just use JSON.stringify(item) with no space argument. Then join those JSON strings with newline (or any character that you know will never appear in your data) and write that data file.

When you read this data, split the incoming data on the newline, then do the JSON.parse() on each of those lines individually. In other words, this step is just like your second solution, but with a straightforward string split instead of having to fiddle with the character counts and curly braces.

Your code might look something like this (really just a simplified version of what you posted):

var fs = require('fs');
var arr2 = fs.readFileSync(
    'JMdict-all.json',
    { encoding: 'utf8' }
).trim().split('\n').map( function( line ) {
    return JSON.parse( line );
});

As you noted in an edit, you could simplify this code to:

var fs = require('fs');
var arr2 = fs.readFileSync(
    'JMdict-all.json',
    { encoding: 'utf8' }
).trim().split('\n').map( JSON.parse );

But I would be careful about this. It does work in this particular case, but there is a potential danger in the more general case.

The JSON.parse function takes two arguments: the JSON text and an optional "reviver" function.

The [].map() function passes three arguments to the function it calls: the item value, array index, and the entire array.

So if you pass JSON.parse directly, it is being called with JSON text as the first argument (as expected), but it is also being passed a number for the "reviver" function. JSON.parse() ignores that second argument because it is not a function reference, so you're OK here. But you can probably imagine other cases where you could get into trouble - so it's always a good idea to triple-check this when you pass an arbitrary function that you didn't write into [].map().

edited Jun 1, 2015 at 4:42

answered Jun 1, 2015 at 3:25

Michael Geary

29k9 gold badges67 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ahmed Fasih Over a year ago

<del>Is there a name for ‘field-separated JSON’? I've actually created files like this before, using tabs, but always felt shady, in part because of hybridizing JSON and TSV, but more seriously also because I never knew what to call this or what file extension to use. I wouldn't want to call it JSON, that'll just cause endless confusion.</del> en.wikipedia.org/wiki/Line_Delimited_JSON looks like it's a thing.

Michael Geary Over a year ago

That's a good point, you wouldn't call the file as a whole JSON, even if each line of it is a JSON text. I would pick any extension you like, or let me suggest: .data :-)

Ahmed Fasih Over a year ago

Line-delimited JSON is a thing, who knew! .ldjson or .ldj is apparently the file extension, or .jsonl.

Michael Geary Over a year ago

Aha! Completing the circle, I added this page as a citation on that Wikipedia article...

Ahmed Fasih Over a year ago

Eeeeek, I didn't know Array.map would pass multiple arguments to a 'first-class function' given to it 😱 it'd be best to always curry such arguments to map.

debater · Accepted Answer · 2017-09-09 13:37:36Z

I think a comment hinted at the answer to this question, but I'll expand on it a little. The 1 GB of memory being used presumably includes a large number of allocations of data that is actually 'dead' (in that it has become unreachable and is therefore not really being used by the program any more) but has not yet been collected by the Garbage Collector.

Almost any algorithm processing a large data set is likely to produce a very large amount of detritus in this manner, when the programming language/technology used is a typical modern one (e.g. Java/JVM, c#/.NET, JavaScript). Eventually the GC removes it.

It is interesting to note that techniques can be used to dramatically reduce the amount of ephemeral memory allocation that certain algorithms incur (by having pointers into the middles of strings), but I think these techniques are hard or impossible to employ in JavaScript.

Collectives™ on Stack Overflow

JSON.parse() on a large array of objects is using way more memory than it should

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related