Java regex replaceAll with exclude pattern

Question

I'm trying to make search keywords bold in result titles by replacing each keyword with kw using replaceAll() method. Also need to ignore any special characters in keywords for highlight. This is the code I'm using but it is double replacing the bold directive in second pass. I am looking for a elegant regex solution since my alternative is becoming too big without covering all cases. For example, with this input:

addHighLight("a b", "abacus")

...I get this result:

<b>a</b>b<b>a</b>cus

public static String addHighLight(String kw, String text) {
    String highlighted = text;
    if (kw != null && !kw.trim().isEmpty()) {
        List<String> tokens = Arrays.asList(kw.split("[^\\p{L}\\p{N}]+"));
        for(String token: tokens) {
            try {
                highlighted = highlighted.replaceAll("(?i)(" + token + ")", "<b>$1</b>");
            } catch ( Exception e) {
                e.printStackTrace();
            }
        }
    }
    return highlighted;
}

FYI, Sets don't keep track of the order in which it their contents are stored, so I switched to a List, which does. I also had to reverse your sample string (i.e. "a b") in order to reproduce your results. — Alan Moore
– Alan Moore, Commented Sep 27, 2013 at 6:35
I used Set to be little more efficient in case word is repeated. e.g for kw = a b b b a — susmit shukla
– susmit shukla, Commented Sep 27, 2013 at 18:49
I guessed as much. And a Set should work fine because the order shouldn't matter. I only changed it to a List to demonstrate that it is order sensitive. — Alan Moore
– Alan Moore, Commented Sep 27, 2013 at 19:35

Alan Moore · Accepted Answer · 2013-09-27 06:28:12Z

1

Don't forget to use Pattern.quote(token) (unless non-regex-escaped kw is guaranteed)
If you're bound to use replaceAll() (instead of tokenizing input into tag|text|tag|text|... and applying replace to texts only, which would've been much simpler and faster) - below code should help

Note that it's not efficient - it matches some empty or already-highlighted spots and thus requires "curing" after substitution, but should treat XML/HTML tags (except CDATA) properly.

Here's a "curing" function (no null checks):

private static Pattern cureDoubleB = Pattern.compile("<b><b>([^<>]*)</b></b>");
private static Pattern cureEmptyB = Pattern.compile("<b></b>");
private static String cure(String input) {
    return cureEmptyB.matcher(cureDoubleB.matcher(input).replaceAll("<b>$1</b>")).replaceAll("");
}

Here's how the replaceAll line should look like:

String txt = "[^<>" + Pattern.quote(token.substring(0, 1).toLowerCase()) + Pattern.quote(token.substring(0, 1).toUpperCase()) +"]*";
highlighted = cure(highlighted.replaceAll("((<[^>]*>)*"+txt+")(((?i)" + Pattern.quote(token) + ")|("+txt+"))", "$1<b>$4</b>$5"));

edited Sep 27, 2013 at 6:28

Alan Moore

75.6k13 gold badges109 silver badges161 bronze badges

answered Sep 27, 2013 at 5:04

Vlad

1,1978 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Vlad Over a year ago

@@Alan Moore - before editing it said "(except <pre>/CDATA)" ;)

Josh · Accepted Answer · 2013-09-27 02:08:36Z

1

Since you're already excluding special characters from your keywords, the simplest way around this might just be to add a bit more to your search pattern. The following should prevent you from matching text that's already part of an html tag:

highlighted = highlighted.replaceAll("(?i)[^<](" + token + ")", "<b>$1</b>");

answered Sep 27, 2013 at 2:08

Josh

1,55311 silver badges16 bronze badges

1 Comment

susmit shukla Over a year ago

This gave me idea about lookbehind although not working completely.

susmit shukla · Accepted Answer · 2013-09-27 18:53:06Z

1

This code worked for me with minimum changes using regex lookbehind

highlighted = highlighted.replaceAll("(?i)((?<!<)(?<!/)" + token + "(?<!>))", "<b>$1</b>");

answered Sep 27, 2013 at 18:53

susmit shukla

2134 silver badges14 bronze badges

1 Comment

Vlad Over a year ago

I don't think You'll like what that'd do to "</table>abacus"

Collectives™ on Stack Overflow

Java regex replaceAll with exclude pattern

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related