4

I'm trying to execute the following C++ STL-based code to replace text in a relatively large SQL script (~8MB):

std::basic_regex<TCHAR> reProc("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$\n((^(?![ \t]*go[ \t]*).*$\n)+)^[ \t]*go[ \t]*$");
std::basic_string<TCHAR> replace = _T("ALTER $1 $2\n$3\ngo");
return std::regex_replace(strInput, reProc, replace);

The result is a stack overflow, and it's hard to find information about that particular error on this particular site since that's also the name of the site.

Edit: I am using Visual Studio 2013 Update 5

Edit 2: The original file is over 23,000 lines. I cut the file down to 3,500 lines and still get the error. When I cut it by another ~50 lines down to 3,456 lines, the error goes away. If I put just those cut lines into the file, the error is still gone. This suggests that the error is not related to specific text, but just too much of it.

Edit 3: A full working example is demonstrated operating properly here: https://regex101.com/r/iD1zY6/1 It doesn't work in that STL code, though.

9
  • do you know the strInput that triggers the stack overflow? Commented May 18, 2016 at 16:14
  • @vu1p3n0x Yes, but I'm not sure how to share such a large input string. I don't want to put 8 MB of text in the question. Commented May 18, 2016 at 16:15
  • your regex is bounded by line ("^...$") Is the file all one line? or is there a single line that triggers it? or is it only when processing the whole file at once what triggers it? Commented May 18, 2016 at 16:21
  • 1
    IMO you should replace this complicated regex by a loop over the lines - std::find_if for "create (view|proc)", std::find_if for "go", grab everything in-between and do your replacement this way. Commented May 18, 2016 at 17:06
  • 1
    Is this me or you just want to change create to alter for each procedure/view ? Commented May 18, 2016 at 17:36

2 Answers 2

2

The following trimmed-down version of your regex saves about 20% of processing steps according to regex101 (see here).

\\bcreate[ \t]+(view|procedure|proc)[ \t]+(.+)\n(((?![ \t]*go[ \t]*).*\n)+)[ \t]*go[ \t]*

Modifications:

  • inline anchors removed: you are expressly testing for newline characters
  • repetition operator for the db object keywords removed - a repetition at this point would make the original script syntactically invalid.
  • initial whitespace pattern replaced by word boundary (note the double backslash - the escape sequence is for the regex engine, not for the compiler)

If you can be sure that ...

  • the create ... statements do not occur in string literals, and

  • you do not need to distinguish between create ... statements followed by a go or not (eg. because all statements are trailed by a go)

...it might even be easier to just replace these strings:

std::basic_regex<TCHAR> reProc("\bcreate[ \t]+(view|procedure|proc)");
std::basic_string<TCHAR> replace = _T("ALTER $1");
return std::regex_replace(strInput, reProc, replace);

(Here is a demo for the latter approach - reduces the steps to a little more than 1/4 th).

Sign up to request clarification or add additional context in comments.

4 Comments

The "\b" at the beginning seems to be preventing the STL regex from matching a create at the beginning of a line for some reason. Need a double \\ I assume.
It seems this solution is still going to be far too slow. If I were writing C# code I would split it up by "\ngo\n" and replace in each component. But I don't know how to do that in STL. It's been running more than a minute and still not done, and I think VB6 was able to do this in less than a minute by processing it one line at a time (I'm rewriting some old code). I thought I could simplify the code by processing it all at once, but the cost turns out to be too high. I don't even know how to split up text into lines with STL.
So maybe splitting as suggested in this SO answer would help? The text portions to be replaced do not appear to span lines.
The demo from the abovementioned answer adjusted to your sample input is online here
1

It turns out that STL regular expressions are tragic under-performers versus Perl (about 100 times slower if you can believe https://stackoverflow.com/a/37016671/78162), so it's apparently necessary to absolutely minimize the use of regular expressions in STL/C++ when performance is a serious concern. (The degree to which C++/STL under-performs here blew my mind considering I presume C++ to generally be one of the more performant languages). I ended up passing the file stream to read one line at a time and only run the expression on lines that needed processing like this:

   std::basic_string<TCHAR> result;
   std::basic_string<TCHAR> line;
   std::basic_regex<TCHAR> reProc(_T("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$"), std::regex::optimize);
   std::basic_string<TCHAR> replace = _T("ALTER $1 $2");

   do {
      std::getline(input, line);
      int pos = line.find_first_not_of(_T(" \t"));
      if ((pos != std::basic_string<TCHAR>::npos) 
          && (_tcsnicmp(line.substr(pos, 6).data(), _T("create"), 6)==0))
         result.append(std::regex_replace(line, reProc, replace));
      else
         result.append(line);
      result.append(_T("\n"));
   } while (!input.eof());
   return result;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.