.net: efficient way to read a binary file into memory then access

Question

I'm a novice programmer. I'm creating a library to process binary files of a certain type -- like a codec (though without a need to process a progressive stream coming over a wire). I'm looking for an efficient way to read the file into memory and then parse portions of the data as needed. In particular, I'd like to avoid large memory copies, hopefully without a lot of added complexity to avoid that.

In some situations, I want to do sequential reading of values in the data. For this, a MemoryStream works well.

    FileStream fs = new FileStream(_fileName, FileMode.Open, FileAccess.Read);
    byte[] bytes = new byte[fs.Length];
    fs.Read(bytes, 0, bytes.Length);
    _ms = new MemoryStream(bytes, 0, bytes.Length, false, true);
    fs.Close();

(That involved a copy from the bytes array into the memory stream; that's one time, and I don't know of a way to avoid it.)

With the memory stream, it's easy to seek to arbitrary positions and then start reading structure members. E.g.,

    _ms.Seek(_tableRecord.Offset, SeekOrigin.Begin);
    byte[] ab32 = new byte[4];

    _version = ConvertToUint(_ms.Read(ab32));
    _numRecords = ConvertToUint(_ms.Read(ab32));
    // etc.

But there may also be times when I want to take a slice out of the memory corresponding to some large structure and then pass into a method for certain processing. MemoryStream doesn't support that. I could always pass the MemoryStream plus offset and length, though that might not always be the most convenient.

Instead of MemoryStream, I could store the data in memory using Memory. That supports slicing, but not sequential reading.

If for some situation I want to get a slice (rather than pass stream & offset/length), I could construct an ArraySegment from MemoryStream.GetBuffer.

    ArraySegment<byte> as = new ArraySegment<byte>(ms.GetBuffer(), offset, length);

It's not clear to me, though, if that will result in a (potentially large) copy, or if that uses a reference into the same memory held by the MemoryStream. I gather that GetBuffer exposes the underlying memory rather than providing a copy; and that ArraySegment will point into the same memory?

There will be times when I need to get a slice that is a copy as I'll need to modify some elements and then process that, but without changing the original. If ArraySegment gets a reference rather than a copy, I gather I could use ArraySegment<byte>.ToArray()?

So, my questions are: Is MemoryStream the best approach? Is there any other type that allows sequential reading like MemoryStream but also allows slicing like Memory?

If I want a slice without copying memory, will ArraySegment<byte>(ms.GetBuffer(), offset, length) do that?

Then if I need a copy that can be modified without affecting the original, use ArraySegment<byte>.ToArray()?

Is there a way to read the data from a file directly into a new MemoryStream without creating a temporary byte array that gets copied?

Am I approaching all this the best way?

Since you can access memory buffer MemoryStream uses you can then take slice out of it to just read, or you can copy some fragment and modify it. In general you answered well all your questions, as for the last, if explicit buffer bothers you you can CopyTo streams. Or in your case you can File.ReadAllBytes to skip FileStream as well. — greenoldman
– greenoldman, Commented May 10, 2020 at 5:01

Peter Constable · Accepted Answer · 2020-05-11 08:16:59Z

To get the initial MemoryStream from reading the file, the following works:

    byte[] bytes;
    try
    {
        // File.ReadAllBytes opens a filestream and then ensures it is closed
        bytes = File.ReadAllBytes(_fi.FullName); 
        _ms = new MemoryStream(bytes, 0, bytes.Length, false, true);
    }
    catch (IOException e)
    {
        throw e;
    }

File.ReadAllBytes() copies the file content into memory. It uses using, which means that it ensures the file gets closed. So no Finally statement is needed.

I can read individual values from the MemoryStream using MemoryStream.Read. These calls involve copies of those values, which is fine.

In one situation, I needed to read a table out of the file, change a value, and then calculate a checksum of the entire file with that change in place. Instead of copying the entire file so that I could edit one part, I was able to calculate the checksum in progressive steps: first on the initial, unchanged segment of the file, then continue with the middle segment that was changed, then continue with the remainder.

For this I could process the first and final segments using the MemoryStream. This involved lots of reads, with each read copying; but those copies were transient variables, so no significant working set increase.

For the middle segment, that needed to be copied since it had to be changed (but the original version needed to be kept intact). The following worked:

    // get ref (not copy!) to the byte array underlying the MemoryStream
    byte[] fileData = _ms.GetBuffer();

    // determine the required length
    int length = _tableRecord.Length;

    // create array to hold the copy
    byte[] segmentCopy = new byte[length];

    // get the copy
    Array.ConstrainedCopy(fileData, _tableRecord.Offset, segmentCopy, 0, length);

After modifying values in segmentCopy, I then needed to pass this to my static method for calculating checksums, which expected a MemoryStream (for sequential reading). This worked:

    // new MemoryStream will hold a ref to the segmentCopy array (no new copy!)
    MemoryStream ms = new MemoryStream(segmentCopy, 0, segmentCopy.Length);

What I haven't needed to do yet, but will want to do, is to get a slice of the MemoryStream that doesn't involve copying. This works:

    MemoryStream sliceFromMS = new MemoryStream(fileData, offset, length);

From above, fileData was a ref to the array underlying the original MemoryStream. Now sliceFromMS will have a ref to a segment within that same array.

Don't use File.ReadAllBytes, that's just waiting to explose in your face for various reasons (memory consumption is one). Holding files in memory via byte[] are most of the time a bad idea as well. Use Stream.CopyTo[Async] with MemoryStream

Vadim Baratashvili · Accepted Answer · 2020-05-10 00:43:25Z

1

You can use FileStream.Seek, as I understand it, there is no need to load data into memory, then to use this method of MemoryStream

In the following example, str1 and str2 are equal:

using (var fs = new FileStream(@"C:\Users\bar_v\OneDrive\Desktop\js_balancer.txt", FileMode.Open))
{
    var buffer = new byte[20];
    fs.Read(buffer, 0, 20);
    var str1= Encoding.ASCII.GetString(buffer);
    fs.Seek(0, SeekOrigin.Begin);
    fs.Read(buffer, 0, 20);
    var str2 = Encoding.ASCII.GetString(buffer);
}

By the way, when you create a new MemoryStream object, you don’t copy the byte array, you just keep a reference to it:

public MemoryStream(byte[] buffer, bool writable)
{
    if (buffer == null)
        throw new ArgumentNullException(nameof(buffer), SR.ArgumentNull_Buffer);

    _buffer = buffer;
    _length = _capacity = buffer.Length;
    _writable = writable;
    _exposable = false;
    _origin = 0;
    _isOpen = true;
}

But when reading, as we can see, copying occurs:

public override int Read(byte[] buffer, int offset, int count)
{
    if (buffer == null)
        throw new ArgumentNullException(nameof(buffer), SR.ArgumentNull_Buffer);
    if (offset < 0)
        throw new ArgumentOutOfRangeException(nameof(offset), SR.ArgumentOutOfRange_NeedNonNegNum);
    if (count < 0)
        throw new ArgumentOutOfRangeException(nameof(count), SR.ArgumentOutOfRange_NeedNonNegNum);
    if (buffer.Length - offset < count)
        throw new ArgumentException(SR.Argument_InvalidOffLen);

    EnsureNotClosed();

    int n = _length - _position;
    if (n > count)
        n = count;
    if (n <= 0)
        return 0;

    Debug.Assert(_position + n >= 0, "_position + n >= 0");  // len is less than 2^31 -1.

    if (n <= 8)
    {
        int byteCount = n;
        while (--byteCount >= 0)
            buffer[offset + byteCount] = _buffer[_position + byteCount];
    }
    else
        Buffer.BlockCopy(_buffer, _position, buffer, offset, n);
    _position += n;

    return n;
}

edited May 10, 2020 at 0:43

answered May 9, 2020 at 21:08

Vadim Baratashvili

665 bronze badges

5 Comments

Peter Constable Over a year ago

I was using MemoryStream instead of FileStream so the data could be held without keeping the file open. One use may be to explore or edit the content, and it could be used in an app that way for an indefinite amount of time. I didn't think it would be a good idea to hold a file open like that. Am I wrong?

Peter Constable Over a year ago

>In the following example, str1 and str2 are equal: That seems obvious: strings are constructed from the same buffer. Are they equal valued, or are they the identical object? If you edit the buffer, do the strings change?

Peter Constable Over a year ago

Are the 2nd and 3rd code snippets you show from github.com/dotnet/runtime/blob/master/src/libraries/…?

Vadim Baratashvili Over a year ago

@PeterConstable 1. Here, it’s not entirely clear to me why we use a MemoryStream, but do not work directly with an array of bytes (for example, using linq methods), because we still load data into memory. To hold or not to hold the file depends on the task. Sometimes we need to block it so that no one else writes to it while we process the data 2.They are equivalent in value, here I just demonstrated how we can navigate through data

Vadim Baratashvili Over a year ago

@PeterConstable 3. I went there directly from the code in Visual Studio by pressing f12 (I'm using resharper)

Collectives™ on Stack Overflow

.net: efficient way to read a binary file into memory then access

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related