0

I have as input text a big html file from where I have to extract some information using pattern matching. The "region" is somehow as follows:

 some html text
 <div debugState" style="display: none;">
            Model: ModelCode[BR324]
            Features: [S08TL, S0230, S0851, S0428, S01CD, S0879, S01CA, S08SP, S0698, S01CB, S0548, S08SC, S08TM, S01CC, S0801, S0258, P0668, S04AK]
            Packages: [S0801]
 </div>
        some html text

I wrote the following code. (At debInfo) is the html source to be scanned. Due to

Pattern model = Pattern.compile(".*(Model: ModelCode\\[\\w\\]).*, Pattern.DOTALL");
Pattern features = Pattern.compile(".*(Features: \\[\\w*\\]).*, Pattern.DOTALL");
Pattern packages = Pattern.compile(".*(Packages: \\[\\w*\\]).*, Pattern.DOTALL");


Matcher m1 = model.matcher(debInfo);
Matcher m2 = features.matcher(debInfo);
Matcher m3 = packages.matcher(debInfo);

boolean a = m1.matches();
boolean b = m2.matches();
boolean c = m3.matches();

System.out.println("matches(); " + a + " " + b + " " + c + " " + "\n" + debInfo);

and I am getting no match :-(. What am I doing wrong? Thanks in advance (a lot!)

1
  • Be aware that unless your HTML is in strict 7-bit ASCII, Java’s character class escapes will not work. That’s because they fail to meet requirement RL1.2a from UTS#18 Unicode Regular Expressions. It also fails to meet most of the other requirements for Basic Unicode Support. Commented Mar 21, 2011 at 18:55

4 Answers 4

3

You use \\w inside your (correctly escaped) square brackets. That matches only a single character. Try \\w+ or \\w* instead.

Also, you have included , Pattern.DOTALL in your String literal, which I think is a typo:

Pattern model = Pattern.compile(".*(Model: ModelCode\\[\\w+\\]).*", Pattern.DOTALL);

Also note that for the comma-and-space separated list of Features \\w* will not work, you'll need something like [\\w\\s,]*.

Sign up to request clarification or add additional context in comments.

1 Comment

true, the "," was missing. It was a copy/paste problem, but that was not the problem. It was the MULTILINE switch what was missing.
2

I think you need to use:

Pattern model = Pattern.compile(".*(Model: ModelCode\\[\\w*\\]).*", Pattern.DOTALL);
Pattern features = Pattern.compile(".*(Features: \\[\\w*\\]).*", Pattern.DOTALL);
Pattern packages = Pattern.compile(".*(Packages: \\[\\w*\\]).*", Pattern.DOTALL);

2 Comments

The * is missing in the first pattern.
@Luixv: then your sample HTML is wrong, as it's got multiple characters at that point.
1

These are the correct patterns:

Pattern modelPattern = Pattern.compile(".*Model: ModelCode\\[(\\w*)\\].*",
        Pattern.DOTALL | Pattern.MULTILINE);
Pattern featuresPattern = Pattern.compile(".*Features: \\[([\\w\\s,]*)\\].*",
        Pattern.DOTALL | Pattern.MULTILINE);
Pattern packagesPattern = Pattern.compile(".*Packages: \\[([\\w\\s,]*)\\].*",
        Pattern.DOTALL | Pattern.MULTILINE);

Comments

0

It was missing the MULTILINE switch.

 Pattern modelPattern = Pattern.compile(".*(Model: ModelCode\\[\\w*\\]).*", Pattern.DOTALL | Pattern.MULTILINE);

1 Comment

No, the MULTILINE flag changes the meaning of the anchors, ^ and $. You aren't using anchors in that regex, so MULTILINE has no effect.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.