Processing large file and nodes and memory limit in nodejs

Question

I have to open a very large file ~15GB and trying to read the whole file using fs.readFileSync and then put the whole file into hashmap based on a key to dedup the file. But then soon I hit the issue that I cant read the whole file into memory because of v8 limit!

I tried to pass the larger memory size using -max-old-space-size still its not working.

Why is that?

Is this a limitation in nodejs or I am missing something?

I have 64GB RAM in my machine.

For example, there is a large file data.txt with the following format and I have to dedup based on uuid:

new record
field_separator
1fd265da-e5a6-11ea-adc1-0242ac120002 <----uuid
field_separator
Bob
field_separator
32
field_separator
Software Engineer
field_separator
Workday
point_separator
new record
field_separator
5396553e-e5a6-11ea-adc1-0242ac120002
field_separator
Tom
field_separator
27
this is a field3
QA Engineer
field_separator
Synopsis
point_separator
........

There is another small file (200 mega) which contains UUID with different values. I have to lookup with the UUID from the above-mentioned file.

The script is just a one-time processing.

How much memory do you have in your system? What exact output are you trying to achieve? We can only really help you with alternative methods if we can see the actual data and the actual operation you're trying to achieve. It is unlikely that the best way to achieve this is reading the entire file into memory at once. — jfriend00
– jfriend00, Commented Aug 24, 2020 at 0:52
@jfriend00 I have 64GB RAM in my system. So if there is a way to put the complete file into memory with node s memory should not be the issue. — Exploring
– Exploring, Commented Aug 24, 2020 at 0:58
What type of file is this? If you can't stream it, can you use a memory-mapped file? — Brad
– Brad, Commented Aug 24, 2020 at 1:04

traktor · Accepted Answer · 2020-08-24 03:14:46Z

1

Node documentation states the maximum buffer size is ~1GB on 32 bit systems and ~2GB on 64 bit systems.

You can also search Stack Overflow for questions about the maximum size of objects or heap memory used by V8, the JavaScript engine used in Node.js.

I uspect the chance of reading a 15GB file into memory and creating objects based on its entire content is about zero, and that you will need to look at alternatives to fs.readFileSync (such as reading a stream, using data base or using a differnt server).

It may be worth verifying that "avaialable" memory values in heap statistics reflect the size set using CLI option --max-old-space-size. Heap statistics can be generated by running

const v8 = require("v8");
console.log( v8.getHeapSpaceStatistics());
console.log( v8.getHeapStatistics());

in Node.

A question answered in 2017 asked about increasing fixed limit on string size. It may have been increased since then, but Comment 9 in (closed) issue 6148 said it was unlikely to ever increase over the limit of 32bit addressing (4GB).

Without changes to buffer and string size limits, fs.readFileSync cannot read and return the contents of a 16GB file as a string or buffer.

edited Aug 24, 2020 at 3:14

answered Aug 24, 2020 at 0:57

traktor

19.7k5 gold badges37 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Exploring Over a year ago

so what the --max-old-space-size flag does. I thought I can pass a large memory with this flag.

traktor Over a year ago

@Exploring an issue of interest - pls see updated reply.

traktor Over a year ago

@Exploring As they say, trust but verify. Writing test code to find the maximum buffer/typed array size that can be used is good - documentation on the web may not always be applicable or the latest. Even if making file processing asynchronous may be a better option, you should be able to synchronously read the file a buffer at a time by opening it and positioning the read with fs.readvSync(fd, buffers[, position . In theory at least - it's a long time since I've used file descriptors..

jfriend00 · Accepted Answer · 2020-08-24 02:29:09Z

If what you're trying to do is this:

Append records to the smaller file whose UUID is unique (not already present in the smaller file)

Then, I would suggest the following process.

Design a scheme for reading the next record from a file and parsing the data into a Javascript object.
Use that scheme to read through all the records in the smaller file (one record at a time), adding each UUID in that file to a Set object (for keeping track of uniqueness).
After you're done with the small file, you now have a Set object containing all the already-known UUIDs.
Now, use that same reading scheme to read each next record (one record at a time) from the larger file. If the record is not in the UUID Set, then add it to that Set and append that record to the smaller file. If the record is in the UUID Set, then skip it.
Continue reading records from the large file until you've checked them all.

Collectives™ on Stack Overflow

Processing large file and nodes and memory limit in nodejs

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related