1

I'm using a java.util.Scanner to scan all occurrences of a given regex from a big string.

Scanner sc = new Scanner(body);
sc.useDelimiter("");
String match = "";
while(match!=null)
{
    match = sc.findWithinHorizon(pattern, 0);
    if(match==null)break;
    MatchResult mr = sc.match();
    System.out.println("Match string: "+mr.group());
    System.out.println("Match string using indexes: "+body.substring(mr.start(),mr.end());
}

The strange thing is that after a certain number of scans, group() method returns the correct occurrence while the start() and end() methods return wrong indexes like the scan has restarted from the beginning of the file. The regex is multiline (i use this regex to discover a line change "\r\n|[\n\r\u2028\u2029\u0085]").

Do you have any hint? Could it be related to the "horizon" parameter (I've tried differend combinations for that value)?

For more details, it seems related to the dimension of the file (more than 1000 chars), after about 1000 the counter restart from 0 (e.g. the first wrong index occurrence after 1003:1020 becomes 3:120).

1 Answer 1

4

Scanner uses an internal buffer with 1024 characters. Use Pattern instead:

Matcher matcher = Pattern.compile(...).matcher(body);
while(matcher.find()) {
    int start = matcher.start();
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.