8

In Scala's parser combinators (JavaTokensParser in particular) there is a definition stringLiteral that matches a Java-like string.

def stringLiteral: Parser[String] =
              ("\""+"""([^"\p{Cntrl}\\]|\\[\\'"bfnrt]|\\u[a-fA-F0-9]{4})*"""+"\"").r

Unfortunately, this regex does not work for long strings. Does anyone know of a re-usable implementation that does, or a modification to the regex that is more space efficient?

1 Answer 1

3

Interesting problem!!

Just played around with this, and came up with the following:

val r = ("\"" + "(?:[^\"\\p{Cntrl}\\\\]*|(?:\\\\(?:[\\\\'\"bfnrt]|u[a-fA-F0-9]{4}))*)*" + "\"").r

Note: the above regex was fixed from the first version… both the leading '\' and the trailing characters need to be repeated, not just the trailing characters as I originally had it!

Edit: Found a more efficient regular expression. Using the following, it can parse a string of up to 950 \\ns pairs, as opposed to the original which could only parse 556, at least in my default configuration.

val r = ("\"" + "(?:[^\"\\p{Cntrl}\\\\]*|\\\\[\\\\'\"bfnrt]|\\\\u[a-fA-F0-9]{4})*" + "\"").r

Edit 2: Based on the comment from @schmmd, I have an even better regular expression. This one can parse the 2500 \ns torture case. The secret is to use the greedy possessive modifier, this basically turns off the need to backtrack, and hence also turns off the recursion.

val r = (""""([^"\p{Cntrl}\\]*+(?:\\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+"""").r

The essence of the solution is to try and chew as much as you can each time that you match something.

scala> val r = (""""([^"\p{Cntrl}\\]*+(?:\\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+"""").r
r: scala.util.matching.Regex = "([^"\p{Cntrl}\\]*+(?:\\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+"

scala> r.pattern.matcher("\"" + "\\ns" * 2500 + "\"").lookingAt
res4: Boolean = true

scala> r.pattern.matcher("\"" + "s" * 2500 + "\"").lookingAt
res5: Boolean = true

Update: A pull request was submitted to the scala folks. And it was accepted.

Sign up to request clarification or add additional context in comments.

5 Comments

Your regex is fine, but it makes sense to me to remove all the disjunctions. It still overflows on your "\\ns" * 2500 though. ("\"" + """([^"\p{Cntrl}\\]*(?:\\[\\'"bfnrt])*(?:\\u[a-fA-F0-9]{4})*)*""" + "\"").r.
Switching all my Kleene stars to possessive Kleene stars provides better performance. ("\"" + """([^"\p{Cntrl}\]*+(?:\[\\'"bfnrt])*+(?:\\u[a-fA-F0-9]{4})*+)*+""" + "\"").r
@schmmd You are right! The possessive operator is indeed the answer. You could even use that in the original answer, and it solves the issue as well.
I'll accept your answer for now. I would love to hear of a library that parsers for this pattern in it, along with methods to convert the parsed Java string into a String instance.
I do hope this gets submitted to the Scala library.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.