0

In my application I need to open a file, look for a tag and then do some operation based on that tag. BUT! the file content alternates every char with a /0, so that the text "CODE" becomes 0x43 0x00 0x4F 0x00 0x44 0x00 0x45 0x00 (expressed in hex byte).

The issue is that the terminator is also a /0 , so the "CODE123" with the terminator would look something like this:

0x43 0x00 0x4F 0x00 0x44 0x00 0x45 0x00 0x31 0x00 0x32 0x00 0x33 0x00 0x00 0x00

Since /0 is the null string terminator, if I use File.ReadAllText() i get only garbage, so I tried using File.ReadAllBytes() and then purging each byte equal to 0. This gets me readable text, but then I lose information on when the data ends, i.e. if in the file there was CODE123[terminator]PROP456[terminator]blablabla I end up with CODE123PROP456blablabla.

So I decided to gets the file content as a byte[], and then look for another byte[] initialized with the CODE-with-/0-inside data. This theoretically should work, but since the data array is fairly large (about 1.5 million elements) this takes way too long.

The final cherry on the cake is that I am looking for multiple occurences of the CODE tag, so I can't just go and stop as soon as I find it.

I tried modifying the LINQ posted as answer here: Find the first occurrence/starting index of the sub-array in C# as follows:

    var indices = (from i in Enumerable.Range(0, 1 + x.Length - y.Length)
                          where x.Skip(i).Take(y.Length).SequenceEqual(y)
                          select (int?)i).ToList();

but as soon as I tried to enumerate the result it just hogs down.

So, my question is: how could I EFFICIENTLY find multiple subarrays in a large array? thanks

10
  • 1
    See my answer elsewhere which explains how to implement a Boyer-Moore search for binary data: stackoverflow.com/a/37500883/106159 Commented Nov 9, 2021 at 13:41
  • 1
    The nulls don't seem to be null string terminators. You just need read it with the correct encoding they are just part of the chars of that encoding. Presumably some kind of utf16 but you should know better then us what your files encoding is. ReadAllText has an overload for the encoding. Commented Nov 9, 2021 at 13:42
  • @Ralf that's exactly the problem: they are not terminators except when they were used as one, so If I try to interpret them i get garbage (the first one is treated as a null string terminator and basically ruin the whole string interpretation), regardless of what encoding I try. Commented Nov 9, 2021 at 13:53
  • If you read with ReadAllText and with Encoding.Unicode you get a single string with string terminator to separate the individual substring. Then Split will give you an array of the individual strings. Commented Nov 9, 2021 at 13:56
  • I'm not quite convinced ;) Have you tried encodings with changed byte order like BigEndianUnicode? Commented Nov 9, 2021 at 13:57

1 Answer 1

0

The wonderful Boyer-Moore algoryth suggested by Matthew Wilson solved my problem amazingly.

I had then to find a solution for finding the actual string terminations, this looks too application-specific to be useful to somebody else so I didn't post it. If you think it may be useful, let me know and I'll post it here :)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.