3

I'm attempting to write a bit of JS that will read a file and write it out to a stream. The deal is that the file is extremely large, and so I have to read it bit by bit. It seems that I shouldn't be running out of memory, but I do. Here's the code:

var size = fs.statSync("tmpfile.tmp").size;

var fp = fs.openSync("tmpfile.tmp", "r");

for(var pos = 0; pos < size; pos += 50000){
    var buf = new Buffer(50000),
        len = fs.readSync(fp, buf, 0, 50000, (function(){
            console.log(pos);
            return pos;
        })());

    data_output.write(buf.toString("utf8", 0, len));

    delete buf;
}

data_output.end();

For some reason it hits 264900000 and then throws FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory. I'd figure that the data_output.write() call would force it to write the data out to data_output, and then discard it from memory, but I could be wrong. Something is causing the data to stay in memory, and I've no idea what it would be. Any help would be greatly appreciated.

1
  • 1
    delete buf; is invalid, try buf = null Commented Oct 25, 2011 at 12:43

4 Answers 4

3

I had a very similar problem. I was reading in a very large csv file with 10M lines, and writing out its json equivalent. I saw in the windows task manager that my process was using > 2GB of memory. Eventually I figured out that the output stream was probably slower than the input stream, and that the outstream was buffering a huge amount of data. I was able to fix this by pausing the instream every 100 writes to the outstream, and waiting for the outstream to empty. This gives time for the outstream to catch up with the instream. I don't think it matters for the sake of this discussion, but I was using 'readline' to process the csv file one line at a time.

I also figured out along the way that if, instead of writing every line to the outstream, I concatenate 100 or so lines together, then write them together, this also improved the memory situation and made for faster operation.

In the end, I found that I could do the file transfer (csv -> json) using just 70M of memory.

Here's a code snippet for my write function:

var write_counter = 0;
var out_string = "";
function myWrite(inStream, outStream, string, finalWrite) {
    out_string += string;
    write_counter++;
    if ((write_counter === 100) || (finalWrite)) {
        // pause the instream until the outstream clears
        inStream.pause();
        outStream.write(out_string, function () {
            inStream.resume();
        });
        write_counter = 0;
        out_string = "";
    }
}
Sign up to request clarification or add additional context in comments.

Comments

2

You should be using pipes, such as:

var fp = fs.createReadStream("tmpfile.tmp");
fp.pipe(data_output);

For more information, check out: http://nodejs.org/docs/v0.5.10/api/streams.html#stream.pipe

EDIT: the problem in your implementation, btw, is that by doing it in chunks like that, the write buffer isn't going to get flushed, and you're going to read in the entire file before writing much of it back out.

Comments

1

According to the documentation, data_output.write(...) will return true if the string has been flushed, and false if it has not (due to the kernel buffer being full). What kind of stream is this?

Also, I'm (fairly) sure this isn't the problem, but: how come you allocate a new Buffer on each loop iteration? Wouldn't it make more sense to initialize buf before the loop?

2 Comments

Ah, good call. That was me just debugging, trying to figure out whether deleting it after each iteration would do anything. This is actually sending a large file to remote storage, and it is HTTP.
Re: HTTP: that makes sense. You can read the file much faster than you can send it over the network, and write does not block until the bytes are actually sent. (It'll just return false if they're not sent yet, and then later issue a drain event once they are.)
0

I don't know how the synchronous file functions are implemented, but have you considered using the asynch ones? That would be more likely to allow garbage collection and i/o flushing to happen. So instead of a for loop, you would trigger the next read in the callback function of the previous read.

Something along these lines (note also that, per other comments, I'm reusing the Buffer):

var buf = new Buffer(50000),
var pos = 0, bytesRead;  

function readNextChunk () {
    fs.read(fp, buf, 0, 50000, pos,
      function(err, bytesRead){
        if (err) {
          // handle error            
        }
        else {
          data_output.write(buf.toString("utf8", 0, bytesRead));
          pos += bytesRead;
          if (pos<size)
            readNextChunk();
        }
      });
}
readNextChunk();

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.