2

I have been working on a program which makes use of Regular Expressions. It searches for some text in the files to give me a database based on the scores of different players.

Here is the sample of the text within which it searches.

    ISLAMABAD UNITED 1st innings

Player              Status                        Runs  Blls  4s   6s    S/R
David Warner        lbw            b. Hassan       19    16    4    0  118.8%
Joe Burns                          b. Morkel       73   149   16    0   49.0%
Kane Wiliiamson                    b. Tahir       135   166   28    2   81.3%
Asad Shafiq         c. Rahane      b. Morkel       22    38    5    0   57.9%
Kraigg Braithwaite  c. Khan        b. Boult        24    36    5    0   66.7%
Corey Anderson                     b. Tahir        18    47    3    0   38.3%
Sarfaraz Ahmed                     b. Morkel        0     6    0    0    0.0%
Tim Southee         c. Hales       b. Morkel        0     6    0    0    0.0%
Kyle Abbbott        c. Rahane      b. Morkel       26    35    4    0   74.3%
Steven Finn         c. Hales       b. Hassan       10    45    1    0   22.2%
Yasir Shah          not out                         1    12    0    0    8.3%

 Total:  338/10       Overs:  92.1         Run Rate:  3.67     Extras:  10 

                             Day 2  10:11 AM


                                     -X-

I am using the following regex to get the different fields..

((?:\/)?(?:[A-Za-z']+)?\s?(?:[A-Za-z']+)?\s?(?:[A-Za-z']+)?\s?)\s+(?:lbw)?(?:not\sout)?(?:run\sout)?\s?(?:\(((?:[A-Za-z']+)?\s?(?:['A-Za-z]+)?)\))?(?:(?:st\s)?\s?(?:((?:['A-Za-z]+)\s(?:['A-Za-z]+)?)))?(?:c(?:\.)?\s((?:(?:['A-Za-z]+)?\s(?:[A-Za-z']+)?)?(?:&)?))?\s+(?:b\.)?\s+((?:[A-Za-z']+)\s(?:[A-Za-z']+)?)?\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)

Batsman Name - Group 1
Person Affecting Stumping (if any) - Group 2
Person Affecting RunOut (if any) - Group 3
Person Taking Catch (if any) - Group 4
Person Taking the wicket (if any) - Group 5
Runs Scored - Group 6
Balls Faced - Group 7
Fours Hit - Group 8
Sixes Hit - Group 9

Here is an example of the text I need to extract...

Group 0 contains David Warner lbw b. Hassan 19 16 4 0 118.8%

Group 1 contains 'David Warner'
Group 2 does not exist in this example
Group 3 does not exist in this example
Group 4 does not exist in this example
Group 5 contains 'Hassan'
Group 6 contains '19'
Group 7 contains '16'
Group 8 contains '4'
Group 9 contains '0'

When I try this on Regexr or Regex101, it gives the Group 1 as David Warner in the Group 1... But in my Java Program, it gives it as David. It is same for all results. I don't know why?

Here's the code of my program:

Matcher bat = Pattern.compile("((?:\\/)?(?:[A-Za-z']+)?\\s?(?:[A-Za-z']+)?\\s?(?:[A-Za-z']+)?\\s?)\\s+(?:lbw)?(?:not\\sout)?(?:run\\sout)?\\s?(?:\\(((?:[A-Za-z']+)?\\s?(?:['A-Za-z]+)?)\\))?(?:(?:st\\s)?\\s?(?:((?:['A-Za-z]+)\\s(?:['A-Za-z]+)?)))?(?:c(?:\\.)?\\s((?:(?:['A-Za-z]+)?\\s(?:[A-Za-z']+)?)?(?:&)?))?\\s+(?:b\\.)?\\s+((?:[A-Za-z']+)\\s(?:[A-Za-z']+)?)?\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)").matcher(batting.group(1));        
while (bat.find()) {
    batPos++;
    Batsman a = new Batsman(bat.group(1).replace("\n", "").replace("\r", "").replace("S/R", "").replace("/R", "").trim(), batting.group(2));
    if (bat.group(0).contains("not out")) {
        a.bat(Integer.parseInt(bat.group(6)), Integer.parseInt(bat.group(7)), Integer.parseInt(bat.group(8)), Integer.parseInt(bat.group(9)), batting.group(2), false);
    } else {
        a.bat(Integer.parseInt(bat.group(6)), Integer.parseInt(bat.group(7)), Integer.parseInt(bat.group(8)), Integer.parseInt(bat.group(9)), batting.group(2), true);
    }
    if (!teams.contains(batting.group(2))) {
        teams.add(batting.group(2));
    }
    boolean f = true;
    Batsman clone = null;
    for (Batsman b1 : batted) {
        if (b1.eq(a)) {
            clone = b1;
            f = false;
            break;
        }
    }
    if (!f) {
        if (bat.group(0).contains("not out")) {
            clone.batUpdate(a.getRunScored(), a.getBallFaced(), a.getFour(), a.getSix(), false, true);

        } else {
            clone.batUpdate(a.getRunScored(), a.getBallFaced(), a.getFour(), a.getSix(), true, true);
        }
    } else {
        batted.add(a);
    }
}
8
  • I suggest that using a regex so complex as yours will come back to bite you on the arse. Commented Sep 5, 2017 at 5:18
  • could You please explain what data exactly do You need to extract from the text provided? Commented Sep 5, 2017 at 5:48
  • 1
    Yeah, what @ScaryWombat said — if regular expressions are getting too complex, it's going to bite you on the arse. Why don't you read the 'first' line (the line with "Player" and "Status"), and then determine the column widths? I think that is easier in the end, as it avoids the complexity of extended regular expressions like these. Commented Sep 5, 2017 at 7:01
  • 1
    @ScaryWombat I agree. But mind your wording. Commented Sep 5, 2017 at 8:24
  • 1
    The fact that you can solve a lot of problems using regexes doesn't mean that you absolutely have to use them for everything. My rule of thumb: when my regexes turn so complicated that I need to turn to other people to master them, it is time to look into other solutions, like writing your little custom parser. Commented Sep 5, 2017 at 8:25

2 Answers 2

2

Your regex is way too complicated for such a simple task. To make it simple(or eliminate it for that matter), operate on a single line rather than the bunch of text.

For this, do

String array[] = str.split("\\n");

Then once you get each individual line, just split by a mutliple spaces, like

String parts[] = array[1].split("\\s\\s+");

Then you can access each part seperately, like Status can be accessed like

System.out.println("Status - " + parts[1]);
Sign up to request clarification or add additional context in comments.

Comments

1

All commentators are right, of course, this might not be a typical problem to solve with a regex. But to answer your question - why is there a difference between java and regex101? - let's try to pull out some of the problems caused by your regex that makes it too complex. Next step would be to track down if and why there is a difference in using it in java.

I tried to understand your regex (and cricket at the same time!) and came up with a proposal that might help you to make us understand what your regex should look like.

First attempt reads until the number columns are reached. My guess is, that you should be looking at alternation instead of introducing a lot of groups. Take a look at this: example 1

Explanation:

(                                            # group 1 start
  \/?                                        # not sure why there should be /?
  [A-Z][a-z]+                                # first name
  (?:\s(?:[A-Z]['a-z]+)+)                    # last name
)

(?:\                                         # spaces
(                                            # group 2 start
  lbw                                        #   lbw or
 |not\sout                                   #   not out or
 |(c\.|st|run\sout)                          #   group 3: c., st or run out
  \s                                         #   space
  \(?                                        #   optional (
  (\w+)                                      #   group 4: name
  \)?                                        #   optional )
))?                                          # group 2 end

(?:\s+                                       # spaces

(                                            # group 5 start
  (?:b\.\s)(\w+)                             # b. name
))?                                          # group 5 end

\s+                                          # spaces

EDIT 1: Actually, there is a 'stumped' option missing in your regex as well. Added that in mine. EDIT 2: Stumped doesn't have a dot. EDIT 3: The complete example can be found at example 2

Some java code to test it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Foo {
    public static void main(String[] args) {
        String[] examples = {
                "David Warner lbw b. Hassan 19 16 4 0 118.8%",
                "Joe Burns b. Morkel 73 149 16 0 49.0%",
                "Asad Shafiq c. Rahane b. Morkel 22 38 5 0 57.9%",
                "Yasir Shah not out 1 12 0 0 8.3%",
                "Yasir Shah st Rahane 1 12 0 0 8.3%",
                "Morne Morkel run out (Shah) 11 17 1 1 64.7%"
        };

        Pattern pattern = Pattern.compile("(\\/?[A-Z][a-z]+(?:\\s(?:[A-Z]['a-z]+)+))(?:\\s+(lbw|not\\sout|(c\\.|st|run\\sout)\\s\\(?(\\w+)\\)?))?(?:\\s+((?:b\\.\\s)(\\w+)))?\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)\\s+(\\d+\\.\\d%)");
        for (String text : examples) {
            System.out.println("TEXT: " + text);
            Matcher matcher = pattern.matcher(text);
            if (matcher.matches()) {
                System.out.println("batsman: " + matcher.group(1));
                if (matcher.group(2) != null) System.out.println(matcher.group(2));
                if (matcher.group(5) != null && matcher.group(5).matches("^b.*"))
                    System.out.println("bowler: " + matcher.group(6));
                StringBuilder sb = new StringBuilder("numbers are: ");
                int[] groups = {7, 8, 9, 10, 11};
                for (int i : groups) {
                    sb.append(" " + matcher.group(i));
                }
                System.out.println(sb.toString());
                System.out.println();
            }
        }
    }
}

7 Comments

Actually this doesn't serve my purpose. The input file will not have a dot after the stumped (st) and also, you forgot to give the regex for the digits. Also just to inform you, the input would have the name of the person who affected the run outs in Paranthesis Eg. 'run out (Rahane)' Well still, this is a lot helpful.
No, I didn't forget. Like I said in my answer, it reads until the number columns. (they are easy). The point is: you have a lot of optional groups in your regex, where it's easier to catch the information in columns in a group. And if stumped doesn't have a dot, fine, that's info we we're missing.
To prove that the number columns are easy look at regex101.com/r/nYCQUK/5. Be complete in your question, so define what the possible values are and give a complete example, so you'll get a complete answer asap. Now, the remaining question is, do you have the same problem in java reading the name?
Thank you brother for your help so far. Just look at the parenthesis point.
Don't get it. You're talking about the 'not out' being followed by name in parentheses? Look at regex101.com/r/nYCQUK/6
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.