Most efficient way to find pattern in byte array

Question

I have this following code:

var file = //Memory stream with a file in it
var bytes = file.ToArray();

I need to search the bytes for the first occurrence (if any) of the specified byte sequence: 0xff, 0xd8. (The purpose of this is to find images embedded in files)

So if for example bytes[6501] contains 0xff and bytes[6502] contains 0xd8, thats a match and I need either the index of the position returned (6501), or a new array, which is a copy of the bytes array, except it doesn't have the keys below 6501 from the old array.

My current solution is to loop:

 for (var index = 0; index < bytes.Length; index++)
 {
     if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
    ...

But it's pretty slow when it's handling bigger files.

Is there some more efficient way to handle this?

One very minor thing - why create a new byte[] in each iteration of the loop? — Daniel Kelley
– Daniel Kelley, Commented Aug 20, 2014 at 9:06
I'm not, I actually create it before the loop and in the loop itself I only use a variable to refer to it, i just didn't wanted the code example to be too complicated. — Fred
– Fred, Commented Aug 20, 2014 at 9:08
You could try implementing a Boyer-Moore search algorithm. Here is a C# implementation for strings which you could use as a guide. — PaddyD
– PaddyD, Commented Aug 20, 2014 at 9:14
How much is your RAM used up when processing big files? Have you considered only processing chunks of limited size? — DMAN
– DMAN, Commented Aug 20, 2014 at 9:14
@Dman: Not much, those files are only couple of megabytes big (usually 2 - 10MB) so the RAM doesn't get eaten up much. — Fred
– Fred, Commented Aug 20, 2014 at 9:18

Cygon · Accepted Answer · 2019-10-03 19:15:56Z

13

If this is time-critical code, I found the C# compiler (both Mono's implementation and Microsoft's) to have special logic to optimize simple scan loops.

So from profiling experience, I'd implement a sequence search with a hardcoded first-element search like this:

/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing

  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }

  // end of array reached without match
  return -1;
}

Quite a bit longer than other suggestions and prone to off-by-1 errors, but if you're scanning a huge chunk of data or doing this for frequent device IO, this setup will avoid feeding the garbage collector and optimize very well.

EDIT 2019-10-03: Fixed issues pointed out by Warren Rox. Thanks! Tests: https://ideone.com/mmACYj

edited Oct 3, 2019 at 19:15

answered Aug 18, 2016 at 14:49

Cygon

9,6609 gold badges46 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Warren Rox Over a year ago

Warning - This algorithm will provide a false negative result when array and sequence are of length 1. Additionally, a false positive is produced when using an offset and trying to match on the last byte in the sequence array.

Cygon Over a year ago

Thanks for letting me know. I've updated the code and it should hopefully handle both cases correctly now!

Dbl · Accepted Answer · 2014-08-20 09:42:38Z

You want to be using a for loop to check your array. The reason why your code is slow is rather simple.

Decompilation shows why:

public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count)
{
  if (source == null)
    throw Error.ArgumentNull("source");
  else
    return Enumerable.SkipIterator<TSource>(source, count);
}

private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
  using (IEnumerator<TSource> enumerator = source.GetEnumerator())
  {
    while (count > 0 && enumerator.MoveNext())
      --count;
    if (count <= 0)
    {
      while (enumerator.MoveNext())
        yield return enumerator.Current;
    }
  }
}

For each for you're looping you're performing a skip, basically unnecessairily iterating over your array again.

SOME Linq operations contain optimizations to use indexers when possible - skip is not one of them unfortunately.

PS:

If i was you i'd change your code to something like

var search = new byte[] {0xff, 0xd8};
var current = new byte[2];
var maxSearchRange = bytes.Length -1;
for (var index = 0; index < maxSearchRange; index++)
{
   current[0] = bytes[index];
   current[1] = bytes[index+1];

   if((search).SequenceEqual(current))
       ...

AlanT · Accepted Answer · 2016-04-19 17:52:34Z

2

Is there a downside to a simple linear search?
Returns start index if found, else -1

private const byte First = 0x0ff;
private const byte Second = 0x0d8;

private static int FindImageStart(IList<byte> bytes) {
    for (var index = 0; index < bytes.Count - 1; index++) {
        if (bytes[index] == First && bytes[index + 1] == Second) {
            return index;
        }
    }
    return -1;
}

edited Apr 19, 2016 at 17:52

answered Aug 20, 2014 at 9:59

AlanT

3,66122 silver badges28 bronze badges

1 Comment

Dusty Over a year ago

The upper bound of the for loop should be bytes.Count -1

OfirD · Accepted Answer · 2020-05-03 13:28:30Z

2

public int FindSequence(byte[] source, byte[] seq)
{
    var start = -1;
    for (var i = 0; i < source.Length - seq.Length + 1 && start == -1; i++)
    {
        var j = 0;
        for (; j < seq.Length && source[i+j] == seq[j]; j++) {}
        if (j == seq.Length) start = i;
    }
    return start;
}

answered May 3, 2020 at 13:28

OfirD

10.7k8 gold badges59 silver badges105 bronze badges

Comments

novist · Accepted Answer · 2014-08-20 09:17:56Z

1

How about simple..?

bytes[] pattern = new bytes[] { 1, 2, 3, 4, 5 };
for (var index = 0, end = bytes.Length - pattern.length; index < end; index++)
{
    bool found = false;
    for(int j = 0; j < pattern.Length && !found; j++)
    {
        found = bytes[index + j] == pattern[j];
    }
    if(found)
        return index;
}

Please note i did not code in c# for a looong time so excuse me syntax errors if there are any. Regard this as pseudo-code (that no longer throws an index error) :)

edited Aug 20, 2014 at 9:17

answered Aug 20, 2014 at 9:00

novist

711 silver badge7 bronze badges

5 Comments

L.B Over a year ago

IndexOutOfRangeException :)

Sriram Sakthivel Over a year ago

You may need index < bytes.Length -1 to avoid IndexOutOfRangeException

Fred Over a year ago

That won't work for me, i sometimes need to check for long sequences (10 bytes or more) and for multiple sequences too, so my ifs would stretched all the way to the moon. (Not to mention I already tried it and it made no noticeable speed difference.)

novist Over a year ago

Updated to fix IndexOutOfRangeException and support arbitrary length patterns. Please adapt it to your needs.

Massimo Over a year ago

inner for, && !found : at the first match, it exits. is it right ?

Collectives™ on Stack Overflow

Most efficient way to find pattern in byte array

5 Answers 5

2 Comments

Comments

1 Comment

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

1 Comment

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related