java: using regex to parse repeated substrings

Question

This is specifically aimed at parsing hex bytes, but there's a more general question here.

Suppose I have a regexp r e.g. \\s*([0-9A-Fa-f]{2})\\s* (optional spaces, 2 hex digits that I'm interested in, and optional spaces).

If I want to parse a string s with this regexp such that:

if s can be divided into a sequence of blocks that matches r, I want to do something for each block. (e.g. ff 7c 0903 02BB aC could be divided in this way.)
If s cannot be divided accordingly, I want to detect this. (e.g. 00 01 02 hi there ab ff and 9 0 2 1 0 and Y0 DEADBEEF and cafe BABE! all fail.)

how could I do this with Java's regexp facilities?

Michael Myers · Accepted Answer · 2009-12-30 20:40:53Z

3

I believe this is a use case for java.util.Scanner. You could use either next(String) or next(Pattern) to discover whether the next token matched your regex.

I don't have a compiler handy, but I think it would go something like this:

Scanner myScanner = new Scanner(mySource);
// default delimiter is any whitespace, so you don't need to call useDelimiter()
Pattern myPattern = Pattern.compile("\\s*([0-9A-Fa-f]{2})\\s*");
String s = null;
while ((s = myScanner.next(myPattern)) != null) {
    // do something with the token
}

answered Dec 30, 2009 at 20:40

Michael Myers♦

193k47 gold badges301 silver badges297 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jason S Over a year ago

interesting, ok, how can I make sure there's no non-matching input before/after/between tokens?

Michael Myers Over a year ago

Hmm... it's been a while, but I think you'd have to try hasNext() and skip().

PSpeed · Accepted Answer · 2009-12-30 21:31:11Z

2

Another option would be to use the regex matcher stuff and the lookingAt() method.

Something like:

Pattern p = Pattern.compile( "\\s*([0-9A-Fa-f]{2})" );
Matcher m = p.matcher( myString );
int lastEnd = 0;
while( m.lookingAt() ) {
    System.out.println( "Hex part:" + m.group(1) );
    lastEnd = m.end();
}   
if( lastEnd < myString.length() ) {
    System.err.println( "Encountered non-hex value at index:" + lastEnd );
}

...or whatever. lookingAt() has to start at the current position and so the matches must all be contiguous. The only error condition to catch is finishing early since that means non-hex-formatted data was encountered.

answered Dec 30, 2009 at 21:31

PSpeed

3,36423 silver badges12 bronze badges

3 Comments

Jason S Over a year ago

neat! I ended up doing this approach manually (checking the previous end() vs. the current start()), didn't know about lookingAt().

Alan Moore Over a year ago

That's not right. lookingAt() only matches at the beginning of the Matcher's region, which is the beginning of the string by default. You could make this approach work by constantly changing the starting bound of the region, but it's much easier just to prepend \G to the regex and use find(). As it is, your code just keeps matching the first two hex digits in an infinite loop (if it matches anything, that is).

PSpeed Over a year ago

He's right. The code I've done that used lookingAt() for similar purpose was also chopping the string up each time... which is another option. myString = myString.substring(lastEnd) is nearly free. I forgot to put it.

rsp · Accepted Answer · 2009-12-30 21:32:39Z

2

You can check the complete input by adding anchors, or by using matches() instead of contains(), the regexp becomes:

^(\\s*([0-9A-Fa-f]{2}))+\\s*$

If this rgeexp matches, you can then proceed and iterate over the matches for:

\\s*([0-9A-Fa-f]{2})

to pick up the hex bytes.

answered Dec 30, 2009 at 21:32

rsp

23.4k6 gold badges59 silver badges72 bronze badges

3 Comments

Jason S Over a year ago

wasn't planning on using 2 regexps, but this is certainly simple + straightforward.

Alan Moore Over a year ago

This is the best answer so far, but the other method you're thinking of is Matcher#find(); contains() is a String method that just does a literal text search.

rsp Over a year ago

@Alan, thanks for your remark, I was refering to the Jakarta ORO methods matches and contains.

Collectives™ on Stack Overflow

java: using regex to parse repeated substrings

3 Answers 3

2 Comments

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related