4

The basic problem is that I need to output all matches of a regex within a file, but there's a few properties that make a solution tricky:

  1. The file could be very large, so the whole file cannot be brought into memory and scanned all at once.
  2. The regex is user-provided, so there is no estimation of how large a match could be
  3. The regex can be multiline, making possible matches even longer, and scanning per-line could cause a match that crosses lines to be missed.
  4. Every match should be returned with its position in the whole file.
  5. Matches should be evaluated lazily (no need to find all matches in the file at once)

As an example for 3, this algorithm

while (not end of file):
  line = file.readNextLine()
  for (each match in regex.find(line)):
    results.append(match)
return results

breaks in the case of

// inline modifier to turn on DOTALL
regex = "(?s)start.*end"
content = "start
end"

because start then end would be presented individually, when the dot can match a newline character.

My current idea is to implement Iterator<MatchResult>, and read into a CharBuffer as a sliding window. java.util.regex.Matcher also has hitEnd(), which can signal if more input would have changed the result of the last find() operation (if the regex is "foobar" and the matcher was evaluated on "foo", hitEnd() would return true. If the matcher was evaluated on "bar" instead, it returns false). If the buffer is full and more input could find a match, the size of the buffer is doubled until a match is found. I could implement Spliterator instead, but I can't tell if that's more or less complicated.

So far I have this basic structure (exception handling omitted for brevity):

public class RegexStreamIterator implements Iterator<MatchResult> {
  private static final int MIN_BUFFER_SIZE_BYTES = 8192;
  private static final int MAX_BUFFER_EXPANSIONS = 4;
  private static final int MAX_BUFFER_SIZE_BYTES = (int) Math.pow(2, MAX_BUFFER_EXPANSIONS) * MIN_BUFFER_SIZE_BYTES;

  private final Reader reader;
  private final Pattern pattern;

  private int readerIndex = 0;
  private Matcher matcher;
  private CharBuffer buffer = CharBuffer.allocate(MIN_BUFFER_SIZE_BYTES);
  private MatchResult currentMatch;

  public RegexStreamIterator(Reader reader, Pattern pattern) {
    this.reader = reader;
    this.pattern = pattern;
  }

  public boolean hasNext() {
    if (currentMatch != null) return true;
    advance();
    return currentMatch != null;
  }

  public MatchResult next() {
    if (currentMatch == null)
      advance();
    if (currentMatch == null)
      throw new NoSuchElementException();

    MatchResult ret = currentMatch;
    currentMatch = null;
    return ret;
  }

  private void advance() {
    // pseudocode while I'm stuck on actual implementation
    if matcher is null
      fill buffer
      if buffer is null
        return
      matcher = pattern.matcher(buffer)
    
    while not end of file:
      if matcher found result and not matcher.hitEnd() // match found and won't grow or get rejected with more input
        this.currentMatch = match
        advance matcher to next match
        return

      if matcher.hitEnd()
        if no match was found in buffer (matcher start index at 0)
          double buffer size 
            if new size bigger than max size, throw exception
          fill from file 
          reset matcher
        else (part of buffer consumed)
          move remaining buffer to new buffer 
          fill from file
          reset matcher
      else
        clear buffer (resetting size)
        fill from file
        reset matcher
  }
}

I'm very stuck on problems like null handling and making a loop that can both seek to the next match and is stable between multiple calls to advance(). This absolutely seems like a problem that has been solved before as well, if common libraries like Apache or Guava have something that cover this that would be fantastic.

9
  • 4
    This sounds like a good fit for Scanner.findAll, which can find regex matches in a character stream. Of course, if you’re using a user-supplied regex, there will always be the possibility of a denial of service attack by supplying a very demanding regex. Commented Apr 16 at 17:21
  • Good find! Having it return a stream might cause all matches to be found at once (since the doc says scanning runs through once a terminal operator is called), but that might still be within acceptable resource bounds. I'll poke around with it and see if I can place controls on match size and number. Commented Apr 16 at 17:27
  • 1
    That merely means that the scanner will not start scanning (reading) until a terminal operator is called, just like a Stream from a list will not do anything until a terminal operator is called. It doesn't mean that the entire contents are stored in memory. Commented Apr 16 at 18:13
  • sidenote(?) : you are aware that, for example, searching a string like "a1ba2b" for a regular expression like "a.*b" will find only one occurrence, in this case the whole string (since .* is matching as much as possible) Commented Apr 16 at 18:20
  • @user85421 yes, that's why checking hitEnd() is needed. If the sliding window presents "a1b" to the matcher, it will return that it found a match and hitEnd() returns true indicating that more input could change the most recent match result. In this case, it's because the .* could change the match from "a1b" to "a1ba2b". Commented Apr 16 at 18:41

1 Answer 1

5

The findAll method of Scanner can do this. It will look for regular expression matches in any character stream, without loading all of the content into memory.

Sign up to request clarification or add additional context in comments.

3 Comments

This function definition has all the right words in it. Assumes a rolling buffer stream, yet it has to retain, or buffer the current possible match pending. Has to backtrack. In essence read the whole file into memory storage equal to the file size, on a lot more regex than you think. In that regard it's not generally useful, even much less effecient.
@sln I don’t know what you mean by “function definition,” but as I said in a comment above, malicious or irresponsible regular expressions can result in a denial of service attack. That’s not a problem with Scanner or with Java; that’s a problem with blindly accepting a user-supplied regex. Every API will have the same problem.
Any function that presumes to match regex within a rolling buffer all make the same assumptions as stated. I base my conclusions on my experience designing regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.