A value consists of any characters and ends as soon as a sep is found or the line itself ends. Use a lazy *? so you only go to the first sep (else you'll run to EOL and match the $ on the first go):
.*?(<sep>|$)
A value does not contain the separator, so use a capturing group to get at it alone:
(.*?)(<sep>|$)
When iterating matches, a find() will consume the value, place it in capture group 1 (access with group(1)), and then consume the separator, so the second call to find() finds the next value and the next separator. sep can be any regex; if you want to split on a plain string, you'll have to escape it first for safety. Also, if you use sep as a regex, be careful about greediness.
As an example: set sep = ([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+, which matches numbers divisible by 3. Then
Pattern pat = Pattern.compile("(.*?)(([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+|$)");
Matcher mat = pat.matcher("a333521c63");
while(mat.find()) {
System.out.println("Field: " + mat.group(1) + "; Terminated by: " + mat.group(2));
}
prints
Field: a; Terminated by: 333
Field: 5; Terminated by: 21
Field: c; Terminated by: 63
Field: ; Terminated by:
Note that if you must use group() (aka group(0)) instead of group(1), then you must use lookarounds, which results in this regex
(?<=<sep>|^).*?(?=<sep>|$)
Because sep is inside a lookbehind, you cannot use +, *, or {n,} inside it, because it's a limitation of the Java regex engine that it cannot handle lookbehinds of potentially infinite size (to be fair, most other engines are even more restrictive). It works in your simple usecases, with commas and fixed strings
Commas: (?<=,|^).*?(?=,|$)
|~| : (?<=\|~\||^).*?(?=\|~\||$)
It even works in this:
Snake : (?<=s{0,10}!|^).*?(?=s{0,10}!|$)
But it won't work for numbers-divisible-by-3:
Div-by-3: (?<=([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+|^).*?(?=([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+|$)
^ ERROR! * in lookbehind
Examples:
Pattern asSeparator(String sep) {
return Pattern.compile("(?<=(" + sep + ")|^).*?(?=(" + sep + ")|$)");
}
String[] seps = { ","
, "\\|~\\|"
, "s{0,10}!"
, "([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+"
};
for(String sep : seps) {
System.out.println("Separator: " + sep);
Pattern pat = asSeparator(sep);
Matcher mat = pat.matcher("a3a,|~|, sssss!6");
while(mat.find()) {
System.out.println(mat.group());
}
System.out.println();
}
Out:
Separator: ,
a3a
|~|
sssss!6
Separator: \|~\|
a3a,
, sssss!6
Separator: s{0,10}!
a3a,|~|,
6
Separator: ([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+
Exception in thread "main" java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 98
(?<=(([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+)|^).*?(?=(([0369]|([258][0369]*[147])|([147]|[258]{2})([0369]|([147][0369]*[258]))*([258]|[147]{2}))+)|$)
^
split.","is a regex matching a comma. Just use the appropriate regex there to get your "arbitrary" values.