14

Scenario - 150MB text file which is the exported Inbox of an old email account. Need to parse through and pull out emails from a specific user and writes these to a new, single file. I have code that works, its just dogged slow.

I'm using marker strings to search for where to begin/end the copy from the original file.

Here's the main function:

 StreamReader sr = new StreamReader("c:\\Thunderbird_Inbox.txt");
        string working = string.Empty;
        string mystring = string.Empty;
        while (!sr.EndOfStream)
        {
            while ((mystring = sr.ReadLine()) != null)
            {
                if (mystring == strBeginMarker)
                {
                    writeLog(mystring);

                    //read the next line
                    working = sr.ReadLine();

                        while( !(working.StartsWith(strEndMarker)))
                        {
                            writeLog(working);
                            working = sr.ReadLine();

                        }
                  }
            }

        }
        this.Text = "DONE!!";
        sr.Close();

The function that writes the selected messages to the new file:

  public void writeLog(string sMessage)
    {
            fw = new System.IO.StreamWriter(path, true);
            fw.WriteLine(sMessage);
            fw.Flush();
            fw.Close();
    }

Again, this process works. I get a good output file, it just takes a long time and I'm sure there are ways to make this faster.

2
  • BTW - You might want to consider the using statement instead of Close() manually - it's safer in case you hit an exception. My example demonstrates... Commented Jan 20, 2011 at 19:19
  • 1
    while (!sr.EndOfStream) is redundant with while ((mystring = sr.ReadLine()) != null) Commented Jan 20, 2011 at 19:20

5 Answers 5

20

The largest optimization would be to change your writeLog method to open the file once at the beginning of this operation, write to it many times, then close it at the end.

Right now, you're opening and closing the file each iteration where you write, which is going to definitely slow things down.

Try the following:

// Open this once at the beginning!
using(fw = new System.IO.StreamWriter(path, true))
{
    using(StreamReader sr = new StreamReader("c:\\Thunderbird_Inbox.txt"))
    {
        string working;
        string mystring;
        while ((mystring = sr.ReadLine()) != null)
        {
           if (mystring == strBeginMarker)
           {
                writeLog(mystring);

                //read the next line
                working = sr.ReadLine();

                while( !(working.StartsWith(strEndMarker)))
                {
                    fw.WriteLine(working);
                    working = sr.ReadLine();
                }
            }
        }
    }
}
this.Text = "DONE!!";
Sign up to request clarification or add additional context in comments.

2 Comments

THIS CHANGES EVERYTHING! :-) Fantastic response. What was taking 7 or 8 minutes finished in about 2 seconds. Better still, I learned some valuable coding techniques.
@paparush: Glad we could help ;)
2

I think you should:

  1. Open files once.
  2. Load source file in memory.
  3. Break it and use several threads for processing.

8 Comments

While I like this answer, in theory - in practice, it will probably not help much. The OP is still likely to be completely IO bound on the output (since it's being written to a single output file), so multithreading will likely yield very little benefit. I don't know if it'd be worth the complexity.
I agree with Reed here. How would you break it into threads? You'd still need to have logic to do so. If you broke it into chunks of equal size, you'd need to handle the case where a single message starts in one chunk and ends in another. If you did logic to break it at a marker boundary, you are pre-parsing before adding the overhead of threading, probably making it less efficient. That's not to say it couldn't be done - it's just much more work than it seems worth.
@Wonko the Sane, I think the simplest way is first - break it for equal parts and find begining of the first message in each, second - start thread proccessing from first message of each section.
@acoolaum - but, you aren't handling the case where the break is mid-message (the start marker is in chunk #1, the end marker is in chunk #2). Plus, there is no "processing" needed in this case - the task is, essentially, to take a number of substrings out of a string and plop them in another string.
@Wonko the Sane, the task is clear for me. In first step you have 2 parts: placing pointers of parts to calculated initial position then move pointer of each part to first message in this part.
|
2

I would just do a simple parser. Note that this assumes (as you do in your code above) that the markers are in fact unique.

You may have to play with the formatting a bit of your output, but here is the general idea:

   // Read the entire file and close it
   using (StreamReader sr = new
   StreamReader("c:\\Thunderbird_Inbox.txt");)
   {
       string data = sr.ReadToEnd();   
   }

   string newData = "";   
   int position = data.IndexOf(strBeginMarker);

   while (position > 0)   
   {
      int endPosition = data.IndexOf(endMarker, position);
      int markerLength = position + strBeginMarker.Length;

     newData += data.Substring(markerLength, endPosition - markerLength);

     position = data.IndexOf(strBeginMarker, position+ endStr.Length);   
   }

  writeLog(newData);

(Note that I don't have a 150 MB file to test this on - YMMV depending on the machine you are using).

2 Comments

You should at least warn OP about the potential dangers of loading a 150MB file at once through sr.ReadToEnd()...
Just did (in a very basic way)
0

I do not have a 150MB text file to test, but if your server has the memory would Reading the hold thing into a string and doing a RegEx pulling out the message work?

1 Comment

RegEx would work, but they can be fairly complex and difficult to get "just right." The parser solution I gave is essentially the same idea, but simpler for an less experienced coder.
0

You could simply declare the StreamWriter object outside of that while loop and just write the line to it inside the loop.

Like this:

StreamWriter sw = new StreamWriter(path, true);
while
{
    // ...
    while( !(working.StartsWith(strEndMarker)))
    {
        sw.WriteLine(working);
        working = sr.ReadLine();
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.