Algorithms and techniques for string search across multiple GiB of text files

Question

I have to create a utility that searches through 40 to 60 GiB of text files as quick as possible.
Each file has around 50 MB of data that consists of log lines (about 630.000 lines per file).
A NOSQL document database is unfortunately no option...

As of now I am using a Aho-Corsaick algorithm for the search which I stole from Tomas Petricek off of his blog. It works very well.

I process the files in Tasks. Each file is loaded into memory by simply calling File.ReadAllLines(path). The lines are then fed into the Aho-Corsaick one by one, thus each file causes around 600.000 calls to the algorithm (I need the line number in my results).

This takes a lot of time and requires a lot of memory and CPU.
I have very little expertise in this field as I usually work in image processing.
Can you guys recommend algorithms and approaches which could speed up the processing?

Below is more detailed view to the Task creation and file loading which is pretty standard. For more information on the Aho-Corsaick, please visit the linked blog page above.

private KeyValuePair<string, StringSearchResult[]> FindInternal(
    IStringSearchAlgorithm algo, 
    string file)
{
    List<StringSearchResult> result = new List<StringSearchResult>();
    string[] lines = File.ReadAllLines(file);
    for (int i = 0; i < lines.Length; i++)
    {
        var results = algo.FindAll(lines[i]);
        for (int j = 0; j < results.Length; j++)
        {
            results[j].Row = i;
        }
    }
    foreach (string line in lines)
    {
        result.AddRange(algo.FindAll(line));
    }
    return new KeyValuePair<string, StringSearchResult[]>(
        file, result.ToArray());
}


public Dictionary<string, StringSearchResult[]> Find(
    params string[] search)
{
    IStringSearchAlgorithm algo = new StringSearch();
    algo.Keywords = search;
    Task<KeyValuePair<string, StringSearchResult[]>>[] findTasks
        = new Task<KeyValuePair<string, StringSearchResult[]>>[_files.Count];
    Parallel.For(0, _files.Count, i => {
        findTasks[i] = Task.Factory.StartNew(
            () => FindInternal(algo, _files[i])
        );
    });
    Task.WaitAll(findTasks);
    return findTasks.Select(t => t.Result)
        .ToDictionary(x => x.Key, x => x.Value);
}

i don't think you want to feed the text into the algorithm line by line, I think that may undermine the search algorithm — MrD at KookerellaLtd
– MrD at KookerellaLtd, Commented Nov 23, 2022 at 17:54
why dont u take Tomas's algorithm and just test it as a single call against a single file - PS I know nothing about this algorithm — MrD at KookerellaLtd
– MrD at KookerellaLtd, Commented Nov 23, 2022 at 17:55
I'd also throw away all the parallel stuff until you get it to work, running stuff in parralel MAY make it run N times quicker (N is probably < 10) but it pays to optimize the algorithm and then throw parallelism at it if it doesnt undermine the algorithm — MrD at KookerellaLtd
– MrD at KookerellaLtd, Commented Nov 23, 2022 at 17:58
Tomas also points out creating the index is slow...but lookups are fast — MrD at KookerellaLtd
– MrD at KookerellaLtd, Commented Nov 23, 2022 at 17:59
@MrDatKookerellaLtd Thanks for your input. For now I ditched the whole parallelism and keep it linear. I also ditched the Aho-Corsaick as it was too slow and I still needed pattern matching as well so I switched to Regex instead. — user13523921
– user13523921, Commented Nov 24, 2022 at 15:30

score 0 · Accepted Answer · 2022-11-29 10:20:57Z

EDIT
See section Initial Answer for the original Answer.

I further optimized my code by doing the following:

Added paging to prevent memory overflow / crash due to large amount of result data.
I offload the search results into local files as soon as they exceed a certain buffer size (64kb in my case).
Offloading the results required me to convert my SearchData struct to binary and back.
Splicing the array of files which are processed and running them in Tasks greatly increased performance (from 35 sec to 9 sec when processing about 25 GiB of search data)

Splicing / scaling the file array
The code below gives a scaled/normalized value for T_min and T_max.
This value can then be used to determine the size of each array holding n-amount of file paths.

private int ScalePartition(int T_min, int T_max)
{
    // Scale m to range.
    int m = T_max / 2;
    int t_min = 4;
    int t_max = Math.Max(T_max / 16, T_min);            
    m = ((T_min - m) / (T_max - T_min)) * (t_max - t_min) + t_max;

    return m;
}

This code shows the implementation of the scaling and splicing.

// Get size of file array portion.
int scale = ScalePartition(1, _files.Count);
// Iterator.
int n = 0;
// List containing tasks.
List<Task<SearchData[]>> searchTasks = new List<Task<SearchData[]>>();
// Loop through files.
while (n < _files.Count) {
    // Local instance of n. 
    // You will get an AggregateException if you use n 
    // as n changes during runtime.
    int num = n;
    // The amount of items to take.
    // This needs to be calculated as there might be an 
    // odd number of elements in the file array.
    int cnt = n + scale > _files.Count ? _files.Count - n : scale;
    // Run the Find(int, int, Regex[]) method and add as task.
    searchTasks.Add(Task.Run(() => Find(num, cnt, regexes)));
    // Increment iterator by the amount of files stored in scale.
    n += scale;
}

Initial Answer

I had the best results so far after switching to MemoryMappedFile and moving from the Aho-Corsaick back to Regex (a demand has been made that pattern matching is a must have).

There are still parts that can be optimized or changed and I'm sure this is not the fastest or best solution but for it's alright.

Here is the code which returns the results in 30 seconds for 25 GiB worth of data:

// GNU coreutil wc defined buffer size.
// Had best performance with this buffer size.
//
// Definition in wc.c:
// -------------------
// /* Size of atomic reads. */
// #define BUFFER_SIZE (16 * 1024)
//
private const int BUFFER_SIZE = 16 * 1024;

private KeyValuePair<string, SearchData[]> FindInternal(Regex[] rgx, string file)
{
    // Buffer for data segmentation.
    byte[] buffer = new byte[BUFFER_SIZE];
    // Get size of file.
    FileInfo fInfo = new FileInfo(file);
    long fSize = fInfo.Length;
    fInfo = null;

    // List of results.
    List<SearchData> results = new List<SearchData>();

    // Create MemoryMappedFile.
    string name = "mmf_" + Path.GetFileNameWithoutExtension(file);
    using (var mmf = MemoryMappedFile.CreateFromFile(
        file, FileMode.Open, name))
    {
        // Create read-only in-memory access to file data.
        using (var accessor = mmf.CreateViewStream(
            0, fSize,
            MemoryMappedFileAccess.Read))
        {
            // Store current position.
            int pos = (int)accessor.Position;
            // Check if file size is less then the 
            // default buffer size.
            int cnt = (int)(fSize - BUFFER_SIZE > 0 
                    ? BUFFER_SIZE 
                    : fSize - BUFFER_SIZE);

            // Iterate through file until end of file is reached.
            while (accessor.Position < fSize)
            {
                // Write data to buffer.
                accessor.Read(buffer, 0, cnt);
                // Update position.
                pos = (int)accessor.Position;
                // Update next buffer size.
                cnt = (int)(fSize - pos >= BUFFER_SIZE 
                    ? BUFFER_SIZE 
                    : fSize - pos);
                // Convert buffer data to string for Regex search.
                string s = Encoding.UTF8.GetString(buffer);
                // Run regex against extracted data.
                foreach (Regex r in rgx) {
                    // Get matches.
                    MatchCollection matches = r.Matches(s);
                    // Create SearchData struct to reduce memory 
                    // impact and only keep relevant data.
                    foreach (Match m in matches) {
                        SearchData sd = new SearchData();
                        // The actual matched string.
                        sd.Match = m.Value; 
                        // The index in the file.
                        sd.Index = m.Index + pos;
                        // Index to find beginning of line.
                        int nFirst = m.Index;
                        // Index to find end of line.
                        int nLast = m.Index;
                        // Go back in line until the end of the
                        // preceeding line has been found.
                        while (s[nFirst] != '\n' && nFirst > 0) {
                            nFirst--;
                        }
                        // Append length of \r\n (new line).
                        // Change this to 1 if you work on Unix system.
                        nFirst+=2;
                        // Go forth in line until the end of the
                        // current line has been found.
                        while (s[nLast] != '\n' && nLast < s.Length-1)  {
                            nLast++;
                        }
                        // Remove length of \r\n (new line).
                        // Change this to 1 if you work on Unix system.
                        nLast-=2;
                        // Store whole line in SearchData struct.
                        sd.Line = s.Substring(nFirst, nLast - nFirst);
                        // Add result.
                        results.Add(sd);
                    }
                }
            }
        }
    }
    return new KeyValuePair<string, SearchData[]>(file, results.ToArray());
}


public List<KeyValuePair<string, SearchData[]>> Find(params string[] search)
{
    var results = new List<KeyValuePair<string, SearchData[]>>();
    // Prepare regex objects.
    Regex[] regexes = new Regex[search.Length];
    for (int i=0; i<regexes.Length; i++) {
        regexes[i] = new Regex(search[i], RegexOptions.Compiled);                
    }

    // Get all search results.
    // Creating the Regex once and passing it
    // to the sub-routine is best as the regex
    // engine adds a lot of overhead.
    foreach (var file in _files) {
        var data = FindInternal(regexes, file);                
        results.Add(data);
    }
    return results;
}

I had a stupid idea yesterday were I though that it might work out converting the file data to a bitmap and looking for the input within pixels as pixel checking is quite fast.

Just for the giggles... here is the non-optimized test code for that stupid idea:

public struct SearchData
{
    public string Line;
    public string Search;
    public int Row;

    public SearchData(string l, string s, int r) {
        Line    = l;
        Search  = s;
        Row     = r;
    }
}


internal static class FileToImage
{
    public static unsafe SearchData[] FindText(string search, Bitmap bmp)
    {
        byte[] buffer = Encoding.ASCII.GetBytes(search);

        BitmapData data = bmp.LockBits(
            new Rectangle(0, 0, bmp.Width, bmp.Height),
            ImageLockMode.ReadOnly, bmp.PixelFormat);

        List<SearchData> results = new List<SearchData>();
        int bpp = Bitmap.GetPixelFormatSize(bmp.PixelFormat) / 8;
        byte* ptFirst = (byte*)data.Scan0;
        byte firstHit = buffer[0];
        bool isFound = false;
        for (int y=0; y<data.Height; y++) {
            byte* ptStride = ptFirst + (y * data.Stride);
            for (int x=0; x<data.Stride; x++) {
                if (firstHit == ptStride[x]) {
                    byte[] temp = new byte[buffer.Length];                       
                    if (buffer.Length < data.Stride-x) {
                        int ret = 0;                            
                        for (int n=0, xx=x; n<buffer.Length; n++, xx++) {                             
                            if (ptStride[xx] != buffer[n]) {
                                break;
                            }
                            ret++;
                        }
                        if (ret == buffer.Length) {

                            int lineLength = 0;
                            for (int n = 0; n<data.Stride; n+=bpp) {
                                if (ptStride[n+2] == 255 &&
                                    ptStride[n+1] == 255 &&
                                    ptStride[n+0] == 255) 
                                {
                                    lineLength=n;
                                }
                            }

                            SearchData sd = new SearchData();
                            byte[] lineBytes = new byte[lineLength];
                            Marshal.Copy((IntPtr)ptStride, lineBytes, 0, lineLength);
                            sd.Search = search;
                            sd.Line = Encoding.ASCII.GetString(lineBytes);
                            sd.Row = y;
                            results.Add(sd);
                        }
                    }
                }
            }             
        }
        return results.ToArray();
        bmp.UnlockBits(data);
        return null;
    }
    

    private static unsafe Bitmap GetBitmapInternal(string[] lines, int startIndex, Bitmap bmp)
    {
        int bpp = Bitmap.GetPixelFormatSize(bmp.PixelFormat) / 8;
        BitmapData data = bmp.LockBits(
            new Rectangle(0, 0, bmp.Width, bmp.Height),
            ImageLockMode.ReadWrite,
            bmp.PixelFormat);

        int index = startIndex;
        byte* ptFirst = (byte*)data.Scan0;
        int maxHeight = bmp.Height;
        if (lines.Length - startIndex < maxHeight) {
            maxHeight = lines.Length - startIndex -1;
        }
        for (int y = 0; y < maxHeight; y++) {
            byte* ptStride = ptFirst + (y * data.Stride);
            index++;
            int max = lines[index].Length;
            max += (max % bpp);
            lines[index] += new string('\0', max % bpp);
            max = lines[index].Length;
            for (int x=0; x+2<max; x+=bpp) {                    
                ptStride[x+0] = (byte)lines[index][x+0];
                ptStride[x+1] = (byte)lines[index][x+1];
                ptStride[x+2] = (byte)lines[index][x+2];
            }
            ptStride[max+2] = 255;
            ptStride[max+1] = 255;
            ptStride[max+0] = 255;
            for (int x = max + bpp; x < data.Stride; x += bpp) {
                ptStride[x+2] = 0;
                ptStride[x+1] = 0;
                ptStride[x+0] = 0;
            }
        }
        bmp.UnlockBits(data);            
        return bmp;
    }


    public static unsafe Bitmap[] GetBitmap(string filePath)
    {
        int bpp = Bitmap.GetPixelFormatSize(PixelFormat.Format24bppRgb) / 8;
        var lines = System.IO.File.ReadAllLines(filePath);
        int y = 0x800; //lines.Length / 0x800;
        int x = lines.Max(l => l.Length) / bpp;
        int cnt = (int)Math.Ceiling((float)lines.Length / (float)y);

        Bitmap[] results = new Bitmap[cnt];
        for (int i = 0; i < results.Length; i++) {
            results[i] = new Bitmap(x, y, PixelFormat.Format24bppRgb);
            results[i] = GetBitmapInternal(lines, i * 0x800, results[i]);
        }
        return results;            
    }

}

dynamicbutter · Accepted Answer · 2023-02-11 01:38:01Z

0

You can split the file into partitions and regex search each partition in parallel then join the results. There are some sharp edges in the details like handling values that span two partitions. Gigantor is a c# library I have created that does this very thing. Feel free to try it or have a look at the source code.

answered Feb 11, 2023 at 1:38

dynamicbutter

3291 silver badge5 bronze badges

4 Comments

user13523921 Over a year ago

Thank you, I actually ended up implementing that in the end. I split each file (if it exceeds a certain buffer size). Those portions are living inside of MemoryMappedFiles which contents are then processed by the Regex engine. So I ended up processing the files in n-Tasks and depending on the file size the Task handling that file splices up the file into multiple MemoryMappedFiles which are processed in parallel.

dynamicbutter Over a year ago

Nice, glad you got it working. I’m curious how do you handle matches that cross chunk boundaries? Or maybe it doesn’t matter if you occasionally miss a match?

user13523921 Over a year ago

After the file completed I check if the chunk n and n+1 have matches. If so, I point a new MemoryMappedFile between those chunks. I practically just take my default buffer size and load half of that from the end of the chunk n and half of the beginning of the n+1 chunk resulting in a chunk that contains the part between which might have additional matches left. I also validate the position of the matches so I don't get duplicate matches. I track length, position, chunk and the metadata necessary in a struct.

user13523921 Over a year ago

Though this does not work nice with pattern that match data that is potentially larger than the buffer size itself... I designed this to work predominantly with log files so in my use-case the patterns usually do not match more than a few lines so I didn't bother implementing anything smarter.

Collectives™ on Stack Overflow

Algorithms and techniques for string search across multiple GiB of text files

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related