Node.js is running out of memory on large bit-by-bit file read

Question

I'm attempting to write a bit of JS that will read a file and write it out to a stream. The deal is that the file is extremely large, and so I have to read it bit by bit. It seems that I shouldn't be running out of memory, but I do. Here's the code:

var size = fs.statSync("tmpfile.tmp").size;

var fp = fs.openSync("tmpfile.tmp", "r");

for(var pos = 0; pos < size; pos += 50000){
    var buf = new Buffer(50000),
        len = fs.readSync(fp, buf, 0, 50000, (function(){
            console.log(pos);
            return pos;
        })());

    data_output.write(buf.toString("utf8", 0, len));

    delete buf;
}

data_output.end();

For some reason it hits 264900000 and then throws FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory. I'd figure that the data_output.write() call would force it to write the data out to data_output, and then discard it from memory, but I could be wrong. Something is causing the data to stay in memory, and I've no idea what it would be. Any help would be greatly appreciated.

delete buf; is invalid, try buf = null

Raynos
– Raynos

2011-10-25 12:43:12 +00:00
Commented Oct 25, 2011 at 12:43 — Raynos
– Raynos, Commented Oct 25, 2011 at 12:43

davidih · Accepted Answer · 2016-12-14 16:05:58Z

I had a very similar problem. I was reading in a very large csv file with 10M lines, and writing out its json equivalent. I saw in the windows task manager that my process was using > 2GB of memory. Eventually I figured out that the output stream was probably slower than the input stream, and that the outstream was buffering a huge amount of data. I was able to fix this by pausing the instream every 100 writes to the outstream, and waiting for the outstream to empty. This gives time for the outstream to catch up with the instream. I don't think it matters for the sake of this discussion, but I was using 'readline' to process the csv file one line at a time.

I also figured out along the way that if, instead of writing every line to the outstream, I concatenate 100 or so lines together, then write them together, this also improved the memory situation and made for faster operation.

In the end, I found that I could do the file transfer (csv -> json) using just 70M of memory.

Here's a code snippet for my write function:

var write_counter = 0;
var out_string = "";
function myWrite(inStream, outStream, string, finalWrite) {
    out_string += string;
    write_counter++;
    if ((write_counter === 100) || (finalWrite)) {
        // pause the instream until the outstream clears
        inStream.pause();
        outStream.write(out_string, function () {
            inStream.resume();
        });
        write_counter = 0;
        out_string = "";
    }
}

ctide · Accepted Answer · 2011-10-25 02:45:11Z

2

You should be using pipes, such as:

var fp = fs.createReadStream("tmpfile.tmp");
fp.pipe(data_output);

For more information, check out: http://nodejs.org/docs/v0.5.10/api/streams.html#stream.pipe

EDIT: the problem in your implementation, btw, is that by doing it in chunks like that, the write buffer isn't going to get flushed, and you're going to read in the entire file before writing much of it back out.

answered Oct 25, 2011 at 2:45

ctide

5,2571 gold badge31 silver badges26 bronze badges

Comments

ruakh · Accepted Answer · 2011-10-25 02:18:55Z

1

According to the documentation, data_output.write(...) will return true if the string has been flushed, and false if it has not (due to the kernel buffer being full). What kind of stream is this?

Also, I'm (fairly) sure this isn't the problem, but: how come you allocate a new Buffer on each loop iteration? Wouldn't it make more sense to initialize buf before the loop?

answered Oct 25, 2011 at 2:18

ruakh

185k29 gold badges292 silver badges324 bronze badges

2 Comments

Ain't Nobody Special Over a year ago

Ah, good call. That was me just debugging, trying to figure out whether deleting it after each iteration would do anything. This is actually sending a large file to remote storage, and it is HTTP.

ruakh Over a year ago

Re: HTTP: that makes sense. You can read the file much faster than you can send it over the network, and write does not block until the bytes are actually sent. (It'll just return false if they're not sent yet, and then later issue a drain event once they are.)

rob · Accepted Answer · 2011-10-26 00:43:00Z

I don't know how the synchronous file functions are implemented, but have you considered using the asynch ones? That would be more likely to allow garbage collection and i/o flushing to happen. So instead of a for loop, you would trigger the next read in the callback function of the previous read.

Something along these lines (note also that, per other comments, I'm reusing the Buffer):

var buf = new Buffer(50000),
var pos = 0, bytesRead;  

function readNextChunk () {
    fs.read(fp, buf, 0, 50000, pos,
      function(err, bytesRead){
        if (err) {
          // handle error            
        }
        else {
          data_output.write(buf.toString("utf8", 0, bytesRead));
          pos += bytesRead;
          if (pos<size)
            readNextChunk();
        }
      });
}
readNextChunk();

Collectives™ on Stack Overflow

Node.js is running out of memory on large bit-by-bit file read

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related