7

I have this following code:

var file = //Memory stream with a file in it
var bytes = file.ToArray();

I need to search the bytes for the first occurrence (if any) of the specified byte sequence: 0xff, 0xd8. (The purpose of this is to find images embedded in files)

So if for example bytes[6501] contains 0xff and bytes[6502] contains 0xd8, thats a match and I need either the index of the position returned (6501), or a new array, which is a copy of the bytes array, except it doesn't have the keys below 6501 from the old array.

My current solution is to loop:

 for (var index = 0; index < bytes.Length; index++)
 {
     if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
    ...

But it's pretty slow when it's handling bigger files.

Is there some more efficient way to handle this?

7
  • One very minor thing - why create a new byte[] in each iteration of the loop? Commented Aug 20, 2014 at 9:06
  • 1
    I'm not, I actually create it before the loop and in the loop itself I only use a variable to refer to it, i just didn't wanted the code example to be too complicated. Commented Aug 20, 2014 at 9:08
  • You could try implementing a Boyer-Moore search algorithm. Here is a C# implementation for strings which you could use as a guide. Commented Aug 20, 2014 at 9:14
  • How much is your RAM used up when processing big files? Have you considered only processing chunks of limited size? Commented Aug 20, 2014 at 9:14
  • @Dman: Not much, those files are only couple of megabytes big (usually 2 - 10MB) so the RAM doesn't get eaten up much. Commented Aug 20, 2014 at 9:18

5 Answers 5

13

If this is time-critical code, I found the C# compiler (both Mono's implementation and Microsoft's) to have special logic to optimize simple scan loops.

So from profiling experience, I'd implement a sequence search with a hardcoded first-element search like this:

/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing

  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }

  // end of array reached without match
  return -1;
}

Quite a bit longer than other suggestions and prone to off-by-1 errors, but if you're scanning a huge chunk of data or doing this for frequent device IO, this setup will avoid feeding the garbage collector and optimize very well.

EDIT 2019-10-03: Fixed issues pointed out by Warren Rox. Thanks! Tests: https://ideone.com/mmACYj

Sign up to request clarification or add additional context in comments.

2 Comments

Warning - This algorithm will provide a false negative result when array and sequence are of length 1. Additionally, a false positive is produced when using an offset and trying to match on the last byte in the sequence array.
Thanks for letting me know. I've updated the code and it should hopefully handle both cases correctly now!
2

You want to be using a for loop to check your array. The reason why your code is slow is rather simple.

Decompilation shows why:

public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count)
{
  if (source == null)
    throw Error.ArgumentNull("source");
  else
    return Enumerable.SkipIterator<TSource>(source, count);
}

private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
  using (IEnumerator<TSource> enumerator = source.GetEnumerator())
  {
    while (count > 0 && enumerator.MoveNext())
      --count;
    if (count <= 0)
    {
      while (enumerator.MoveNext())
        yield return enumerator.Current;
    }
  }
}

For each for you're looping you're performing a skip, basically unnecessairily iterating over your array again.

SOME Linq operations contain optimizations to use indexers when possible - skip is not one of them unfortunately.

PS:

If i was you i'd change your code to something like

var search = new byte[] {0xff, 0xd8};
var current = new byte[2];
var maxSearchRange = bytes.Length -1;
for (var index = 0; index < maxSearchRange; index++)
{
   current[0] = bytes[index];
   current[1] = bytes[index+1];

   if((search).SequenceEqual(current))
       ...

Comments

2

Is there a downside to a simple linear search?
Returns start index if found, else -1

private const byte First = 0x0ff;
private const byte Second = 0x0d8;

private static int FindImageStart(IList<byte> bytes) {
    for (var index = 0; index < bytes.Count - 1; index++) {
        if (bytes[index] == First && bytes[index + 1] == Second) {
            return index;
        }
    }
    return -1;
}

1 Comment

The upper bound of the for loop should be bytes.Count -1
2
public int FindSequence(byte[] source, byte[] seq)
{
    var start = -1;
    for (var i = 0; i < source.Length - seq.Length + 1 && start == -1; i++)
    {
        var j = 0;
        for (; j < seq.Length && source[i+j] == seq[j]; j++) {}
        if (j == seq.Length) start = i;
    }
    return start;
}

Comments

1

How about simple..?

bytes[] pattern = new bytes[] { 1, 2, 3, 4, 5 };
for (var index = 0, end = bytes.Length - pattern.length; index < end; index++)
{
    bool found = false;
    for(int j = 0; j < pattern.Length && !found; j++)
    {
        found = bytes[index + j] == pattern[j];
    }
    if(found)
        return index;
}

Please note i did not code in c# for a looong time so excuse me syntax errors if there are any. Regard this as pseudo-code (that no longer throws an index error) :)

5 Comments

IndexOutOfRangeException :)
You may need index < bytes.Length -1 to avoid IndexOutOfRangeException
That won't work for me, i sometimes need to check for long sequences (10 bytes or more) and for multiple sequences too, so my ifs would stretched all the way to the moon. (Not to mention I already tried it and it made no noticeable speed difference.)
Updated to fix IndexOutOfRangeException and support arbitrary length patterns. Please adapt it to your needs.
inner for, && !found : at the first match, it exits. is it right ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.