3

The following regex removes lists and menus from webpage text

String s = fileContent.replaceAll("(([A-Za-z&—:\\-\\/\\d ])*(\\n|\\r|\\r\\n)){5,}","");

It has worked no problem on tens of thousands of files. Today, it gave me a stackoverflow:

java.lang.StackOverflowError at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776) at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4435) at java.util.regex.Pattern$GroupCurly.match(Pattern.java:4405) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) at java.util.regex.Pattern$Loop.match(Pattern.java:4785) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4568) at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3798) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4485) at java.util.regex.Pattern$GroupCurly.match(Pattern.java:4405) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) at java.util.regex.Pattern$Loop.match(Pattern.java:4785)

The file it was parsing has hundreds of consecutive \r\n. Other than that, I can't see anything unusual. Can someone advise as to what aspect of the expression and/or the java internal regex parsing caused the error?

4
  • 3
    I believe this answers it: stackoverflow.com/questions/7509905/… Commented Dec 13, 2017 at 23:15
  • 1
    That's it! Funny it didn't come up when I searched. Thanks a million Commented Dec 13, 2017 at 23:18
  • 1
    You're welcome. Very interesting case. Commented Dec 13, 2017 at 23:18
  • 1
    What happens if you replace (\\n|\\r|\\r\\n) with (\\n|\\r\\n?) to reduce the number of alternative paths? Commented Dec 13, 2017 at 23:36

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.