6

I have a text file containing 21000 strings (one line each) and 500 MB of other text files (maily source codes). For each string I need to determine if it is contained in any of those files. I wrote program that does the job but its performance is terrible (it would do that in couple of days, I need to have the job done in 5-6 hours max).
I'm writing using C#, Visual Studio 2010

I have couple of questions regarding my problem:
a) Which approach is better?

foreach(string s in StringsToSearch)
{
    //scan all files and break when string is found
}

or

foreach(string f in Files)
{
    //search that file for each string that is not already found
}

b) Is it better to scan one file line by line

StreamReader r = new StreamReader(file);
while(!r.EndOfStream)
{
    string s = r.ReadLine();
    //... if(s.Contains(xxx));
}

or

StreamReader r = new StreamReader(file);
string s = r.ReadToEnd();
//if(s.Contains(xxx));

c) Would threading improve performance and how to do that?
d) Is there any software that can do that so I don't have to write my own code?

4
  • do you have to write the program? windows has findstr built in. You can use a for loop that could search these other files in parallel Commented Oct 21, 2010 at 12:10
  • Definitely not a correct/complete answer, but don't load all (500MB!) files for each string. Once you have (part of) the file in memory, do all of your actions then. Commented Oct 21, 2010 at 12:11
  • I ment to load whole file one by one, not 500 MB of files at once. Commented Oct 21, 2010 at 12:14
  • What OS are you on, if you are on Win7 you can search within files automatically and its pretty rapid. Commented Oct 21, 2010 at 12:23

5 Answers 5

6

If you are just wanting to know if the string is found or not found, and don't need to do any further processing, then I'd suggest you just use grep. Grep is extremely fast and designed for exactly this kind of problem.

grep -f strings-file other-files...

should do the trick. I'm sure there is a Windows implementation out there somewhere. At worst, Cygwin will have it.

EDIT: This answers question d)

Sign up to request clarification or add additional context in comments.

3 Comments

Yes, despite the [C#] tag this might be the best approach.
I'm not familiar with grep and how it works so maybe little help how to use that?
Grep is a very common tool in *nix systems. There's a lot of documentation out there so there's bound to be a good tutorial somewhere. The suggested command looks for all strings found in "strings-file" in any of the "other-files" and prints out all the matching lines in "other-files". There are many options for changing the output to what you need.
4

You want to minimize of File I/O, so your first idea is very bad because you would be opening the 'other' files up to 21.000 times. You want to use something based on the second one (a1). And when those other files aren't overly big, load them into memory once with readAllText.

List<string> keys = ...;    // load all strings

foreach(string f in Files)
{
    //search for each string that is not already found
    string text = System.IO.File.ReadAllText(f);  //easy version of ReadToEnd


    // brute force
    foreach(string key in keyes)
    {
        if (text.IndexOf(key) >= 0) ....
    }

}

The brute force part can be improved upon but I think you will find it acceptable.

2 Comments

Is if(text.IndexOf(key)>=0) faster than if(text.Contains(key)) ?
@Ichi: No, I would expect them to be equally fast.
2

You might want to look at the Windows Search SDK here

http://msdn.microsoft.com/en-us/library/aa965362%28VS.85%29.aspx

Comments

2

Does the search have to be real time on current 500 MB of text? The reason I ask is because you could build a search index on the text files and perform search. It would be much faster...Take a look at Lucene

Lucene.Net

C# and Lucene to index and search

2 Comments

It doesn't need to be real time search. It is one time task. Do and forget about it :P
Then use Lucene (I have not used Windows Search SDK) to build a full search index and perform lookups against it...I used Lucene before...It's fast!
2
  1. In both a) and b), second option is efficient
  2. threading may not improve the performance coz each thread would read the file from your disk, so you disk will become bottleneck.
  3. sry i have no idea about s/w for your purpose

thread snippet

      foreach (FileInfo file in FileList)
      {
         Thread t  = new Thread(new ParameterizedThreadStart(ProcessFileData));
         t.Start(file.FullName);  
       }//where processFileData is the method that process the files

General I/O Guidelines

What follows are some basic recommendations for reducing the I/O activity of your program, and thus enhancing its performance. As with all recommendations, it is important to measure the performance of the code being optimized before and after optimization to ensure that it actually gets faster.

  1. Minimize the number of file operations you perform
  2. Group several small I/O transfers into one large transfer. A single write of eight pages is faster than eight separate single-page writes, primarily because it allows the hard disk to write the data in one pass over the disk surface. For more information,
  3. Perform sequential reads instead of seeking and reading small blocks of data. The kernel transparently clusters I/O operations, which makes sequential reads much faster.
  4. Avoid skipping ahead in an empty file before writing data. The system must write zeroes into the intervening space to fill the gap. For more information, see Reading is typically cheaper than writing data.
  5. Defer any I/O operations until the point that your application actually needs the data.
  6. Use the preferences system to capture only user preferences (such as window positions and view settings) and not data that can be inexpensively recomputed.
  7. Do not assume that caching file data in memory will speed up your application. Storing file data in memory improves speed until that memory gets swapped out to disk, at which point you pay the price for accessing the disk once again. Strive to find an appropriate balance between reading from disk and caching in memory

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.