Parsing text using Regex

Question

So I am trying to parse a String that contains two key components. One tells me the timing options, and the other is position.

Here is what the text looks like

KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif

The {iiii} is the position and the {ttt} is the timing options.

I need to separate the {ttt} and {iiii} out so I can get a full file name: example, position 1 and time slice 1 = KB_H9Oct4GFP_20130305_p0000001t000000001z001c02.tif

So far here is how I am parsing them:

    int startTimeSlice = 1;
    int startTile = 1;
    String regexTime = "([^{]*)\\{([t]+)\\}(.*)";
    Pattern patternTime = Pattern.compile(regexTime);       
    Matcher matcherTime = patternTime.matcher(filePattern);

    if (!matcherTime.find() || matcherTime.groupCount() != 3)
    {

        throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
    }

    String timePrefix = matcherTime.group(1);
    int tCount = matcherTime.group(2).length();
    String timeSuffix = matcherTime.group(3);

    String timeMatcher = timePrefix + "%0" + tCount + "d" + timeSuffix;


    String timeFileName = String.format(timeMatcher, startTimeSlice);

    String regex = "([^{]*)\\{([i]+)\\}(.*)";
    Pattern pattern = Pattern.compile(regex);       
    Matcher matcher = pattern.matcher(timeFileName);        



    if (!matcher.find() || matcher.groupCount() != 3)
    {
        throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
    }

    String prefix = matcher.group(1);
    int iCount = matcher.group(2).length();
    String suffix = matcher.group(3);

    String nameMatcher = prefix + "%0" + iCount + "d" + suffix;

    String fileName = String.format(nameMatcher, startTile);

Unfortunately my code is not working and it fails when checking if the second matcher finds anything in timeFileName.

After the first regex check it gets the following as the timeFileName: 000000001z001c02.tif, so it is cutting off the beginning potions including the {iiii}

Unfortunately I cannot assuming which group goes first ({iiii} or {ttt}), so I am trying to devise a solution that just handles {ttt} first and then processes {iiii}.

Also, here is another example of valid text that I am also trying to parse: F_{iii}_{ttt}.tif

Do they all have the trailing 't' and 'z' characters to differentiate which is which should the order change? Your last example makes it look like the 't' and 'z' may be absent in some cases. — Marsh
– Marsh, Commented Mar 4, 2014 at 20:37
Indeed it is sadly not guaranteed that the z and t to be that as with the last example F_{iii}_{ttt}.tif — Jameshobbs
– Jameshobbs, Commented Mar 4, 2014 at 20:39
Can you guarantee the ordering when z and t are missing? If not, you'll definitely need some way of differentiating or you will get some incorrect results with the F_{iii}_{ttt}.tif files. — Marsh
– Marsh, Commented Mar 4, 2014 at 20:41
Regex is not required. I basically need to provide an easy tool for people to input text that indices how many digits are in the position and how many digits are in the time slices as well as which ones are position and time slice. — Jameshobbs
– Jameshobbs, Commented Mar 4, 2014 at 20:43

Braj · Accepted Answer · 2014-03-04 21:15:35Z

Steps to follow:

Find string {ttt...} in file name
Form a number format based on no of "t" in string
Find string {iiii...} in file name
Form a number format based on no of "i" in string
Use String.replace() method to replace time and possition

Here is the code:

String filePattern = "KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif";
int startTimeSlice = 1;
int startTile = 1;

Pattern patternTime = Pattern.compile("(\\{[t]*\\})");
Matcher matcherTime = patternTime.matcher(filePattern);

if (matcherTime.find()) {
    String timePattern = matcherTime.group(0);// {ttt}

    NumberFormat timingFormat = new DecimalFormat(timePattern.replaceAll("t", "0")
            .substring(1, timePattern.length() - 1));// 000

    Pattern patternPosition = Pattern.compile("(\\{[i]*\\})");
    Matcher matcherPosition = patternPosition.matcher(filePattern);

    if (matcherPosition.find()) {
        String positionPattern = matcherPosition.group(0);// {iiii}

        NumberFormat positionFormat = new DecimalFormat(positionPattern
                .replaceAll("i", "0").substring(1, positionPattern.length() - 1));// 0000

        System.out.println(filePattern.replace(timePattern,
                timingFormat.format(startTimeSlice)).replace(positionPattern,
                positionFormat.format(startTile)));
    }
}

Jameshobbs · Accepted Answer · 2014-03-04 20:55:09Z

0

Okay, so after a bit of testing I found a way to handle the case:

For parsing the {ttt} I can use the regex: (.*)\\{t([t]+)\\}(.*)

Now this means I have to increment tCount by one to account for the t I grab from \\{t

Same goes for {iii}: (.*)\\{i([i]+)\\}(.*)

answered Mar 4, 2014 at 20:55

Jameshobbs

5241 gold badge8 silver badges17 bronze badges

2 Comments

ajb Over a year ago

Why increment by one? Just move the left parenthesis: (t[t]+), and now it will catch all the t's in the group. Or (t{2,}) matches two or more t's. By the way, there's no reason to put a single character in square brackets, unless you think it's more readable.

Jameshobbs Over a year ago

Indeed you are right. Here is a more final version that includes everything. (.*)(\\{[i]+\\})(.*)

ajb · Accepted Answer · 2014-03-04 20:57:38Z

0

Your first pattern looks like this:

String regexTime = "([^{]*)\\{([t]+)\\}(.*)";

This finds a string consisting of a sequence of zero or more non-{ characters, followed by {t...t}, followed by other characters.

When your input is

KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif

the first substring that matches is

iiii}t00000{ttt}z001c02.tif

The { before the i's can't match, because you told it only to match non-{ characters. The result is that when you re-form the string to do the second match, it will start with iiii} and therefore won't match {iiii} like you're trying to do.

When you're looking for {ttt...}, I don't see any reason to exclude { or any other character from the first part of the string. So changing the regex to

"^(.*)\\{(t+\\}(.*)$"

may be a simple way to fix this. Note that if you want to make sure you include the entire beginning of the string and the entire end of the string in your groups, you should include ^ and $ to match the beginning and end of the string, respectively; otherwise the matcher engine may decide not to include everything. In this case, it won't, but it's a good habit to get into anyway, because that makes things explicit and doesn't require anyone to know the difference between "greedy" and "reluctant" matching. Or use matches() instead of find(), since matches() automatically tries to match the entire string.

edited Mar 4, 2014 at 20:57

answered Mar 4, 2014 at 20:51

ajb

31.8k4 gold badges63 silver badges86 bronze badges

1 Comment

Jameshobbs Over a year ago

Sorry I have a typo in the regex above. I will edit and remove the '. The ' was not in my original code.

Floris · Accepted Answer · 2014-03-04 21:15:57Z

0

Perhaps an easier way to do this (as confirmed by http://regex101.com/r/vG7kY7) is

(\{i+\}).*(\{t+\})

You don't need the [] around a single character you are matching. Keep it simple. i+ means "one or more i's", and as long as these are in the order given, this expression will work (with the first match being {iiii} and the second {ttttt}).

You may need to escape the backslash when writing it in a string...

answered Mar 4, 2014 at 21:15

Floris

46.6k7 gold badges73 silver badges128 bronze badges

Collectives™ on Stack Overflow

Parsing text using Regex

4 Answers 4

Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related