0

This is specifically aimed at parsing hex bytes, but there's a more general question here.

Suppose I have a regexp r e.g. \\s*([0-9A-Fa-f]{2})\\s* (optional spaces, 2 hex digits that I'm interested in, and optional spaces).

If I want to parse a string s with this regexp such that:

  • if s can be divided into a sequence of blocks that matches r, I want to do something for each block. (e.g. ff 7c 0903 02BB aC could be divided in this way.)

  • If s cannot be divided accordingly, I want to detect this. (e.g. 00 01 02 hi there ab ff and 9 0 2 1 0 and Y0 DEADBEEF and cafe BABE! all fail.)

how could I do this with Java's regexp facilities?

3 Answers 3

3

I believe this is a use case for java.util.Scanner. You could use either next(String) or next(Pattern) to discover whether the next token matched your regex.

I don't have a compiler handy, but I think it would go something like this:

Scanner myScanner = new Scanner(mySource);
// default delimiter is any whitespace, so you don't need to call useDelimiter()
Pattern myPattern = Pattern.compile("\\s*([0-9A-Fa-f]{2})\\s*");
String s = null;
while ((s = myScanner.next(myPattern)) != null) {
    // do something with the token
}
Sign up to request clarification or add additional context in comments.

2 Comments

interesting, ok, how can I make sure there's no non-matching input before/after/between tokens?
Hmm... it's been a while, but I think you'd have to try hasNext() and skip().
2

Another option would be to use the regex matcher stuff and the lookingAt() method.

Something like:

Pattern p = Pattern.compile( "\\s*([0-9A-Fa-f]{2})" );
Matcher m = p.matcher( myString );
int lastEnd = 0;
while( m.lookingAt() ) {
    System.out.println( "Hex part:" + m.group(1) );
    lastEnd = m.end();
}   
if( lastEnd < myString.length() ) {
    System.err.println( "Encountered non-hex value at index:" + lastEnd );
}

...or whatever. lookingAt() has to start at the current position and so the matches must all be contiguous. The only error condition to catch is finishing early since that means non-hex-formatted data was encountered.

3 Comments

neat! I ended up doing this approach manually (checking the previous end() vs. the current start()), didn't know about lookingAt().
That's not right. lookingAt() only matches at the beginning of the Matcher's region, which is the beginning of the string by default. You could make this approach work by constantly changing the starting bound of the region, but it's much easier just to prepend \G to the regex and use find(). As it is, your code just keeps matching the first two hex digits in an infinite loop (if it matches anything, that is).
He's right. The code I've done that used lookingAt() for similar purpose was also chopping the string up each time... which is another option. myString = myString.substring(lastEnd) is nearly free. I forgot to put it.
2

You can check the complete input by adding anchors, or by using matches() instead of contains(), the regexp becomes:

^(\\s*([0-9A-Fa-f]{2}))+\\s*$

If this rgeexp matches, you can then proceed and iterate over the matches for:

\\s*([0-9A-Fa-f]{2})

to pick up the hex bytes.

3 Comments

wasn't planning on using 2 regexps, but this is certainly simple + straightforward.
This is the best answer so far, but the other method you're thinking of is Matcher#find(); contains() is a String method that just does a literal text search.
@Alan, thanks for your remark, I was refering to the Jakarta ORO methods matches and contains.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.