1

I'm trying to make search keywords bold in result titles by replacing each keyword with <b>kw</b> using replaceAll() method. Also need to ignore any special characters in keywords for highlight. This is the code I'm using but it is double replacing the bold directive in second pass. I am looking for a elegant regex solution since my alternative is becoming too big without covering all cases. For example, with this input:

addHighLight("a b", "abacus") 

...I get this result:

<<b>b</b>>a</<b>b</b>><b>b</b><<b>b</b>>a</<b>b</b>>cus

public static String addHighLight(String kw, String text) {
    String highlighted = text;
    if (kw != null && !kw.trim().isEmpty()) {
        List<String> tokens = Arrays.asList(kw.split("[^\\p{L}\\p{N}]+"));
        for(String token: tokens) {
            try {
                highlighted = highlighted.replaceAll("(?i)(" + token + ")", "<b>$1</b>");
            } catch ( Exception e) {
                e.printStackTrace();
            }
        }
    }
    return highlighted;
}
3
  • FYI, Sets don't keep track of the order in which it their contents are stored, so I switched to a List, which does. I also had to reverse your sample string (i.e. "a b") in order to reproduce your results. Commented Sep 27, 2013 at 6:35
  • I used Set to be little more efficient in case word is repeated. e.g for kw = a b b b a Commented Sep 27, 2013 at 18:49
  • I guessed as much. And a Set should work fine because the order shouldn't matter. I only changed it to a List to demonstrate that it is order sensitive. Commented Sep 27, 2013 at 19:35

3 Answers 3

1
  1. Don't forget to use Pattern.quote(token) (unless non-regex-escaped kw is guaranteed)
  2. If you're bound to use replaceAll() (instead of tokenizing input into tag|text|tag|text|... and applying replace to texts only, which would've been much simpler and faster) - below code should help

Note that it's not efficient - it matches some empty or already-highlighted spots and thus requires "curing" after substitution, but should treat XML/HTML tags (except CDATA) properly.

Here's a "curing" function (no null checks):

private static Pattern cureDoubleB = Pattern.compile("<b><b>([^<>]*)</b></b>");
private static Pattern cureEmptyB = Pattern.compile("<b></b>");
private static String cure(String input) {
    return cureEmptyB.matcher(cureDoubleB.matcher(input).replaceAll("<b>$1</b>")).replaceAll("");
}

Here's how the replaceAll line should look like:

String txt = "[^<>" + Pattern.quote(token.substring(0, 1).toLowerCase()) + Pattern.quote(token.substring(0, 1).toUpperCase()) +"]*";
highlighted = cure(highlighted.replaceAll("((<[^>]*>)*"+txt+")(((?i)" + Pattern.quote(token) + ")|("+txt+"))", "$1<b>$4</b>$5"));
Sign up to request clarification or add additional context in comments.

1 Comment

@@Alan Moore - before editing it said "(except <pre>/CDATA)" ;)
1

Since you're already excluding special characters from your keywords, the simplest way around this might just be to add a bit more to your search pattern. The following should prevent you from matching text that's already part of an html tag:

highlighted = highlighted.replaceAll("(?i)[^<](" + token + ")", "<b>$1</b>");

1 Comment

This gave me idea about lookbehind although not working completely.
1

This code worked for me with minimum changes using regex lookbehind

highlighted = highlighted.replaceAll("(?i)((?<!<)(?<!/)" + token + "(?<!>))", "<b>$1</b>");

1 Comment

I don't think You'll like what that'd do to "</table>abacus"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.