I have a huge object that serves as a map with 2,7 million keys. I attempt to write the object to the file system, in order to persist it and not recompute it every time I need it. At another step, I need to read the object again. I need to have access to the entire object in memory, as it needs to serve as a map.
For writing, I convert the object to an array and stream it to the file system with the function below. The reason why I convert it to an array first is that it seems to be significantly faster to stream an array instead of an object. The writing part takes about a minute, which is fine. The output file has a size of 4,8GB.
The problem I'm facing is when attempting to read the file. For this, I create a read stream and parse the content.
However, for some reason, I seem to be hitting some sort of memory limit. I used various different approaches for reading and parsing, and they all seem to work up until around 50% of the data is read (at this time the node process on my machine occupies 6GB memory, which is slightly below the limit I set). From then, the reading time significantly increases by factor 10, probably because node is close to using the maximum allocated memory limit (6144MB). It feels like I'm doing something wrong.
The main thing that I don't understand is why writing is not a problem, while reading is, even though during the write step, the entire array is kept in memory as well. I'm using node v8.11.3.
So to summarize:
- I have a large object I need to persist to the file system as an array using streams
- Writing works fine
- Reading works until around 50% of the data is read, then reading time increases significantly
How can I read the file more performantly?
I tried various libraries, such as stream-to-array, read-json-stream, JSONStream
example of an object to write:
{ 'id': ['some_other_id_1', 'some_other_id_2'] }
this then gets converted to an array before writing:
[{ 'id': ['some_other_id_1', 'some_other_id_2'] }]
function to write the array to the file system using streams:
import * as fs from 'fs'
import * as jsonStream from 'JSONStream'
import * as streamifyArray from 'stream-array'
async function writeFileAsStreamFromArray(pathToFile: string, fileContent: any[]): Promise<void> {
return new Promise((resolve, reject) => {
const fileWriterStream = fs.createWriteStream(pathToFile)
const stringifierStream = jsonStream.stringify()
const readStream = streamifyArray(fileContent)
readStream.pipe(stringifierStream)
stringifierStream.pipe(fileWriterStream)
fileWriterStream.on('finish', () => {
console.log('writeFileAsStreamFromArray: File written.')
stringifierStream.end()
resolve()
})
fileWriterStream.on('error', (err) => {
console.log('err', err)
reject(err)
})
})
}
function to get array from stream using jsonStream:
async function getArrayFromStreamUsingJsonStream(pathToFile: string): Promise<any[]> {
return new Promise(async (resolve, reject) => {
const readStream = fs.createReadStream(pathToFile)
const parseStream = jsonStream.parse('*')
const array = []
const start = Date.now()
const transformer = transform((entry) => {
array.push(entry)
if ((array.length % 100000) === 0) {
const end = (Date.now() - start) / 1000
console.log('array', array.length, end)
}
})
readStream.pipe(parseStream)
parseStream.pipe(transformer)
readStream.on('end', () => {
console.log('getArrayFromStreamUsingJsonStream: array created')
parseStream.end()
resolve(array)
})
readStream.on('error', (error) => {
reject(error)
})
})
}
Timing logs (after 1200000 entries, I canceled the execution since it took forever):
array 100000 6.345
array 200000 12.863
array 300000 21.177
array 400000 29.638
array 500000 35.884
array 600000 42.079
array 700000 48.74
array 800000 65.662
array 900000 89.805
array 1000000 120.416
array 1100000 148.892
array 1200000 181.921
...
Expected result: Should be way more performant than it currently is. Is that even possible? Or am I missing something obvious?
Any help is much appreciated!!