0

I want to achieve what String.split() method in Java does, but not by calling split() method.

For example, calling "comma,separated,values".split(",", -1) will give me 3 values. Now I want to achieve the same thing, but using regular expression, instead of calling split() method.

The separator is arbitrary (can consist multiple characters), and the values being split should be arbitrary as well.

I've tried searching previous StackOverflow answers, and people suggest using negative lookahead, but I haven't found the exact regex that works for arbitrary separator and values.

Thanks in advance.

PS: the 'arbitrary' separator is known when the regex is about to be constructed. It is passed to the regex builder as parameter. It could be a comma, a pipe, or combination of several characters, like pipe-tilde-pipe.

And I understand split() itself accepts regex as parameter. To make it clear, what I want is something like this:

Pattern pattern = Pattern.compile("<the regex pattern>");
Matcher matcher = pattern.matcher("comma,separated,values");
while (matcher.find()) {
  System.out.println(matcher.group());
}

And the following will be printed:

  • comma
  • separated
  • values
3
  • 3
    You might be surprised, but you already are using a regex inside split. "," is a regex matching a comma. Just use the appropriate regex there to get your "arbitrary" values. Commented Oct 30, 2017 at 12:00
  • The String.split() source code is openly available. Commented Oct 30, 2017 at 12:30
  • Yes, but regex in split is to identify the comma, like what you said. What I want to identify are the values, not the comma. BTW, the arbitrary separator is already known when the regex is about to be constructed. It is passed as parameter into the regex builder. The separator could be a comma, a pipe, or even combination of characters like pipe-tilde-pipe. Commented Oct 30, 2017 at 16:01

1 Answer 1

0

A value consists of any characters and ends as soon as a sep is found or the line itself ends. Use a lazy *? so you only go to the first sep (else you'll run to EOL and match the $ on the first go):

.*?(<sep>|$)

A value does not contain the separator, so use a capturing group to get at it alone:

(.*?)(<sep>|$)

When iterating matches, a find() will consume the value, place it in capture group 1 (access with group(1)), and then consume the separator, so the second call to find() finds the next value and the next separator. sep can be any regex; if you want to split on a plain string, you'll have to escape it first for safety. Also, if you use sep as a regex, be careful about greediness.

As an example: set sep = ([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+, which matches numbers divisible by 3. Then

Pattern pat = Pattern.compile("(.*?)(([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+|$)");
Matcher mat = pat.matcher("a333521c63");
while(mat.find()) {
    System.out.println("Field: " + mat.group(1) + "; Terminated by: " + mat.group(2));
}

prints

Field: a; Terminated by: 333
Field: 5; Terminated by: 21
Field: c; Terminated by: 63
Field: ; Terminated by:

Note that if you must use group() (aka group(0)) instead of group(1), then you must use lookarounds, which results in this regex

(?<=<sep>|^).*?(?=<sep>|$)

Because sep is inside a lookbehind, you cannot use +, *, or {n,} inside it, because it's a limitation of the Java regex engine that it cannot handle lookbehinds of potentially infinite size (to be fair, most other engines are even more restrictive). It works in your simple usecases, with commas and fixed strings

Commas: (?<=,|^).*?(?=,|$)
|~|   : (?<=\|~\||^).*?(?=\|~\||$)

It even works in this:

Snake : (?<=s{0,10}!|^).*?(?=s{0,10}!|$)

But it won't work for numbers-divisible-by-3:

Div-by-3: (?<=([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+|^).*?(?=([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+|$)
                                  ^ ERROR! * in lookbehind

Examples:

Pattern asSeparator(String sep) {
    return Pattern.compile("(?<=(" + sep + ")|^).*?(?=(" + sep + ")|$)");
}
String[] seps = { ","
                , "\\|~\\|"
                , "s{0,10}!"
                , "([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+"
                };
for(String sep : seps) {
    System.out.println("Separator: " + sep);
    Pattern pat = asSeparator(sep);
    Matcher mat = pat.matcher("a3a,|~|, sssss!6");
    while(mat.find()) {
        System.out.println(mat.group());
    }
    System.out.println();
}

Out:

Separator: ,
a3a
|~|
 sssss!6

Separator: \|~\|
a3a,
, sssss!6

Separator: s{0,10}!
a3a,|~|, 
6

Separator: ([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+
Exception in thread "main" java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 98
(?<=(([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+)|^).*?(?=(([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+)|$)
                                                                                                  ^
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.