I am running a c++ program in VS. I provided a regex and I am parsing a file which is over 2 million lines long for strings that match that regex. Here is the code:
int main() {
ifstream myfile("file.log");
if (myfile.is_open())
{
int order_count = 0;
regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
for (string line; getline(myfile, line);)
{
smatch matches;
if (regex_search(line, matches, pat)) {
order_count++;
}
}
myfile.close();
cout << order_count;
}
return 0;
}
The file should search for the matched strings and count their occurrences. I have a python version of the program that does this within 4 seconds using the same regex. I have been waiting around 5 minutes for the above c++ code to work and it still hasn't finished. It is not running into an infinite loop because I had it print out its current line number at certain intervals and it is progressing.Is there a different way I should write the above code?
EDIT: This is run in release mode.
EDIT: Here is the python code:
class PythonLogParser:
def __init__(self, filename):
self.filename = filename
def open_file(self):
f = open(self.filename)
return f
def count_stuff(self):
f = self.open_file()
order_pattern = re.compile(r'(.*(SOME)(\s)*(TEXT).*)')
order_count = 0
for line in f:
if order_pattern.match(line) != None:
order_count+=1 # = order_count + 1
print 'Number of Orders (\'ORDER\'): {0}\n'.format(order_count)
f.close()
The program finally stopped running. What's most disconcerting is that the output is incorrect (I know what the correct value should be).
Perhaps using regex for this problem is not the best solution. I will update if I find a solution that works better.
EDIT: Based on the answer by @ecatmur, I made the following changes, and the c++ program ran much faster.
int main() {
ifstream myfile("file.log");
if (myfile.is_open())
{
int order_count = 0;
regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
for (string line; getline(myfile, line);)
{
if (regex_match(line, pat)) {
order_count++;
}
}
myfile.close();
cout << order_count;
}
return 0;
}
for line in fit will read the whole file) while your C++ program is doing ~ 2M reads (one for every line). Reading/writing from/to the disk is generally slow, so I guess it's better to read the whole file content to once and then parse the corresponding string line by line.regex_searchwith a line that just incrementsorder_count, and see if the program still takes a long time to run.