C# looking for subarrays in large byte array representing strings

Question

In my application I need to open a file, look for a tag and then do some operation based on that tag. BUT! the file content alternates every char with a /0, so that the text "CODE" becomes 0x43 0x00 0x4F 0x00 0x44 0x00 0x45 0x00 (expressed in hex byte).

The issue is that the terminator is also a /0 , so the "CODE123" with the terminator would look something like this:

0x43 0x00 0x4F 0x00 0x44 0x00 0x45 0x00 0x31 0x00 0x32 0x00 0x33 0x00 0x00 0x00

Since /0 is the null string terminator, if I use File.ReadAllText() i get only garbage, so I tried using File.ReadAllBytes() and then purging each byte equal to 0. This gets me readable text, but then I lose information on when the data ends, i.e. if in the file there was CODE123[terminator]PROP456[terminator]blablabla I end up with CODE123PROP456blablabla.

So I decided to gets the file content as a byte[], and then look for another byte[] initialized with the CODE-with-/0-inside data. This theoretically should work, but since the data array is fairly large (about 1.5 million elements) this takes way too long.

The final cherry on the cake is that I am looking for multiple occurences of the CODE tag, so I can't just go and stop as soon as I find it.

I tried modifying the LINQ posted as answer here: Find the first occurrence/starting index of the sub-array in C# as follows:

    var indices = (from i in Enumerable.Range(0, 1 + x.Length - y.Length)
                          where x.Skip(i).Take(y.Length).SequenceEqual(y)
                          select (int?)i).ToList();

but as soon as I tried to enumerate the result it just hogs down.

So, my question is: how could I EFFICIENTLY find multiple subarrays in a large array? thanks

See my answer elsewhere which explains how to implement a Boyer-Moore search for binary data: stackoverflow.com/a/37500883/106159 — Matthew Watson
– Matthew Watson, Commented Nov 9, 2021 at 13:41
The nulls don't seem to be null string terminators. You just need read it with the correct encoding they are just part of the chars of that encoding. Presumably some kind of utf16 but you should know better then us what your files encoding is. ReadAllText has an overload for the encoding. — Ralf
– Ralf, Commented Nov 9, 2021 at 13:42
@Ralf that's exactly the problem: they are not terminators except when they were used as one, so If I try to interpret them i get garbage (the first one is treated as a null string terminator and basically ruin the whole string interpretation), regardless of what encoding I try. — Marcomattia Mocellin
– Marcomattia Mocellin, Commented Nov 9, 2021 at 13:53
If you read with ReadAllText and with Encoding.Unicode you get a single string with string terminator to separate the individual substring. Then Split will give you an array of the individual strings. — Steve
– Steve, Commented Nov 9, 2021 at 13:56
I'm not quite convinced ;) Have you tried encodings with changed byte order like BigEndianUnicode? — Ralf
– Ralf, Commented Nov 9, 2021 at 13:57

Marcomattia Mocellin · Accepted Answer · 2021-11-11 15:53:51Z

0

The wonderful Boyer-Moore algoryth suggested by Matthew Wilson solved my problem amazingly.

I had then to find a solution for finding the actual string terminations, this looks too application-specific to be useful to somebody else so I didn't post it. If you think it may be useful, let me know and I'll post it here :)

answered Nov 11, 2021 at 15:53

Marcomattia Mocellin

3906 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

C# looking for subarrays in large byte array representing strings

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related