2

I am running a c++ program in VS. I provided a regex and I am parsing a file which is over 2 million lines long for strings that match that regex. Here is the code:

int main() {
    ifstream myfile("file.log");
    if (myfile.is_open())
    {
        int order_count = 0;
        regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
        for (string line; getline(myfile, line);)
        {
            smatch matches;
            if (regex_search(line, matches, pat)) {
                order_count++;
            }
        }
        myfile.close();
        cout << order_count;
    }

    return 0;
}

The file should search for the matched strings and count their occurrences. I have a python version of the program that does this within 4 seconds using the same regex. I have been waiting around 5 minutes for the above c++ code to work and it still hasn't finished. It is not running into an infinite loop because I had it print out its current line number at certain intervals and it is progressing.Is there a different way I should write the above code?

EDIT: This is run in release mode.

EDIT: Here is the python code:

class PythonLogParser:

def __init__(self, filename):
    self.filename = filename

def open_file(self):
    f = open(self.filename)
    return f

def count_stuff(self):
    f = self.open_file()
    order_pattern = re.compile(r'(.*(SOME)(\s)*(TEXT).*)')
    order_count = 0
    for line in f:
        if order_pattern.match(line) != None:
            order_count+=1 # = order_count + 1
    print 'Number of Orders (\'ORDER\'): {0}\n'.format(order_count)
    f.close()

The program finally stopped running. What's most disconcerting is that the output is incorrect (I know what the correct value should be).

Perhaps using regex for this problem is not the best solution. I will update if I find a solution that works better.

EDIT: Based on the answer by @ecatmur, I made the following changes, and the c++ program ran much faster.

int main() {
        ifstream myfile("file.log");
        if (myfile.is_open())
        {
            int order_count = 0;
            regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
            for (string line; getline(myfile, line);)
            {
                if (regex_match(line, pat)) {
                    order_count++;
                }
            }
            myfile.close();
            cout << order_count;
        }

        return 0;
    }
15
  • 3
    Do you run it in Debug mode? Commented Jul 16, 2015 at 13:56
  • 2
    Add your python version as well. please. Do you compare on the same machine? Is antivirus software turned off? Commented Jul 16, 2015 at 13:58
  • 1
    make sure you read answers here: stackoverflow.com/questions/14205096/… Commented Jul 16, 2015 at 14:00
  • 2
    I guess your python version is performing one read (in for line in f it will read the whole file) while your C++ program is doing ~ 2M reads (one for every line). Reading/writing from/to the disk is generally slow, so I guess it's better to read the whole file content to once and then parse the corresponding string line by line. Commented Jul 16, 2015 at 14:05
  • 2
    You can test the theory that the C++ program is slowed down by reading one-line-at-a-time: replace the regex_search with a line that just increments order_count, and see if the program still takes a long time to run. Commented Jul 16, 2015 at 14:21

1 Answer 1

8

You should be using regex_match, not regex_search.

7.2.5.3. search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string

And:

std::regex_search

regex_search will successfully match any subsequence of the given sequence, whereas std::regex_match will only return true if the regular expression matches the entire sequence.

By using regex_search you are generating n * m match results, where n is the number of characters before and m is the number of characters after the central part of your search string. Not surprisingly, this takes a long time to generate.

In fact, the more efficient would be to use regex_search, but only with the central part of your search string:

    regex pat(R"((SOME)(\s)*(TEXT))");

And use the overload of regex_search that doesn't take a match results out-parameter (as you're ignoring the match results):

        if (regex_search(line, pat)) {    // no "matches"
            order_count++;
        }
Sign up to request clarification or add additional context in comments.

5 Comments

The C++ regex_constants::match_continuous may also help mimic the python match - my python here is a little hazy.
Worked much more quickly than the regex_search after making the changes, however, the output is still incorrect even though I'm using the same regex in both c++ and python. is there a difference between the way c++ interprets regex and the way python interprets regex?
@JeremyFisher oh, plenty; the default C++ regex syntax is Modified ECMAScript, which is pretty close to the Javascript syntax: en.cppreference.com/w/cpp/regex/ecmascript ; meanwhile Python has its own syntax: docs.python.org/2/library/re.html#regular-expression-syntax - you'll have to look into exactly where they disagree on your pattern.
@JeremyFisher: change \s to \\s. To improve your patterns in general, avoid to create useless capture groups. (\s)* => \s*
@CasimiretHippolyte he's using a raw literal, so I don't think double-escaping \s is necessary.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.