Writing and Reading large arrays using streams with Node.js

Question

I have a huge object that serves as a map with 2,7 million keys. I attempt to write the object to the file system, in order to persist it and not recompute it every time I need it. At another step, I need to read the object again. I need to have access to the entire object in memory, as it needs to serve as a map.
For writing, I convert the object to an array and stream it to the file system with the function below. The reason why I convert it to an array first is that it seems to be significantly faster to stream an array instead of an object. The writing part takes about a minute, which is fine. The output file has a size of 4,8GB.
The problem I'm facing is when attempting to read the file. For this, I create a read stream and parse the content. However, for some reason, I seem to be hitting some sort of memory limit. I used various different approaches for reading and parsing, and they all seem to work up until around 50% of the data is read (at this time the node process on my machine occupies 6GB memory, which is slightly below the limit I set). From then, the reading time significantly increases by factor 10, probably because node is close to using the maximum allocated memory limit (6144MB). It feels like I'm doing something wrong.
The main thing that I don't understand is why writing is not a problem, while reading is, even though during the write step, the entire array is kept in memory as well. I'm using node v8.11.3.

So to summarize:

I have a large object I need to persist to the file system as an array using streams
Writing works fine
Reading works until around 50% of the data is read, then reading time increases significantly

How can I read the file more performantly?

I tried various libraries, such as stream-to-array, read-json-stream, JSONStream

example of an object to write:

{ 'id': ['some_other_id_1', 'some_other_id_2'] }

this then gets converted to an array before writing:

[{ 'id': ['some_other_id_1', 'some_other_id_2'] }]

function to write the array to the file system using streams:

import * as fs from 'fs'
import * as jsonStream from 'JSONStream'
import * as streamifyArray from 'stream-array'

async function writeFileAsStreamFromArray(pathToFile: string, fileContent: any[]): Promise<void> {
  return new Promise((resolve, reject) => {
    const fileWriterStream = fs.createWriteStream(pathToFile)
    const stringifierStream = jsonStream.stringify()
    const readStream = streamifyArray(fileContent)
    readStream.pipe(stringifierStream)
    stringifierStream.pipe(fileWriterStream)

    fileWriterStream.on('finish', () => {
      console.log('writeFileAsStreamFromArray: File written.')
      stringifierStream.end()
      resolve()
    })
    fileWriterStream.on('error', (err) => {
      console.log('err', err)
      reject(err)
    })
  })
}

function to get array from stream using jsonStream:

async function getArrayFromStreamUsingJsonStream(pathToFile: string): Promise<any[]> {
  return new Promise(async (resolve, reject) => {
    const readStream = fs.createReadStream(pathToFile)
    const parseStream = jsonStream.parse('*')
    const array = []
    const start = Date.now()

    const transformer = transform((entry) => {
      array.push(entry)
      if ((array.length % 100000) === 0) {
        const end = (Date.now() - start) / 1000
        console.log('array', array.length, end)
      }
    })
    readStream.pipe(parseStream)
    parseStream.pipe(transformer)

    readStream.on('end', () => {
      console.log('getArrayFromStreamUsingJsonStream: array created')
      parseStream.end()
      resolve(array)
    })
    readStream.on('error', (error) => {
      reject(error)
    })
  })
}

Timing logs (after 1200000 entries, I canceled the execution since it took forever):

array 100000 6.345
array 200000 12.863
array 300000 21.177
array 400000 29.638
array 500000 35.884
array 600000 42.079
array 700000 48.74
array 800000 65.662
array 900000 89.805
array 1000000 120.416
array 1100000 148.892
array 1200000 181.921
...

Expected result: Should be way more performant than it currently is. Is that even possible? Or am I missing something obvious?

Any help is much appreciated!!

matt helliwell · Accepted Answer · 2019-11-02 16:12:44Z

3

I suspect it is running out of memory because you are trying to read all the entries into a single continuous array. As the array fills up, node is going to reallocate the array and copy the existing data to the new array. So as the array gets bigger and bigger, it gets slower and slower. Because it has to have two arrays in place when reallocating, it is also going to use more memory than just the array by itself would.

You could use a database as a few millions rows shouldn't be a problem, or write you own read/write routines making sure you use something that allows non-sequential block memory allocation, eg https://www.npmjs.com/package/big-array

Eg Preallocate an array 10k entries long, read the first 10k entries of the map into the array and write the array to a file. Then read the next 10k entries into the array and write this to a new file. Repeat until you've processed all the entries. That should reduce your memory usage and lend itself to speeding up by running IO in parallel at the expense of using more memory.

edited Nov 2, 2019 at 16:12

answered Nov 2, 2019 at 13:17

matt helliwell

2,69619 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jaga Over a year ago

Thank you for your answer! What you say makes sense to me. I guess using a database is the way to go! Do you have any recommendations on what kind of database to use? I'm not sure which one would meet my criteria (writing and reading millions of entries in a couple of minutes)

matt helliwell Over a year ago

@jaga If you don't already have a DB then I'd try MySQL or Postgres as either will do the job. The performance will depend on the hardware more than the choice of DB. Before you go down that route though, think of other ways of storing the map. I've updated my answer with an idea of one way to approach this.

jaga Over a year ago

thanks again for your input! I ended up using a Postgres and it seems to be doing the job just fine! I decided to go for the database approach since it made the lookup easier in the end :)

Collectives™ on Stack Overflow

Writing and Reading large arrays using streams with Node.js

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related