10

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:

String s = "Page 3 of 23"

If I substitute all digits by \d

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (e.g. "Page 7 of 47"). My problem is that if I do this naively some of the metacharacters such as (){}-, etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).

NOTE: @dasblinkenlight's edited answer now works for me!

2
  • Here's an answer to the which characters question, I'm not aware of any libraries to generate regexs though: stackoverflow.com/questions/399078/… Commented Apr 16, 2013 at 10:16
  • @Evan thanks. I am only interested in Java so that looks like a useful resource. Commented Apr 16, 2013 at 10:18

1 Answer 1

10

Java's regexp library provides this functionality:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d to make a regular expression. Since regex library uses \Q and \E for quoting, you need to enclose your portion of regex in inverse quotes of \E and \Q.

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8.

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\QPage \E\d+\Q of \E\d+\Q\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d, because the result is fed directly to regex engine, bypassing the Java compiler.

Sign up to request clarification or add additional context in comments.

11 Comments

Cool, I didn't know about this method.
@dasblinkenlight Great! Agreed I might look for repeated digits but there is also heuristics value for me in having exact digit counts. I may use both approaches.
@peter.murray.rust See the last edit: the number of slashes required to make two slashes in the output is really ridiculous - times two for the compiler and times two for the regex library, for the total of eight slashes.
@dasblinkenlight. Which is why I am grateful to you for creating it! Takes me a long time to get the number right! (I tend to use constants to help break it down)
I think this answer has everything I need and can be extended so I have accepted it
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.