1

I like to extract text from html page using regular expressions. Here is my code:

String regExp="<h3 class=\"field-content\"><a[^>]*>(\\w+)</a></h3>";
    Pattern regExpMatcher=Pattern.compile(regExp,Pattern.UNICODE_CHARACTER_CLASS);

    String example="<h3 class=\"field-content\"><a href=\"/humana-akcija-na-kavadarechkite-navivachi-lozari\">Проба 1</a></h3><h3 class=\"field-content\"><a href=\"/opshtina-berovo-ne-mozhe-da-sostavi-sovet-0\">Проба 2</a></h3>";
    Matcher m=regExpMatcher.matcher(example);
    while(m.find())
    {

        System.out.println(m.group(1));
    }

I like to get the values Проба 1 and Проба 2. However I only get the first value Проба 1. What is my problem?

6
  • 6
    Don't use regex for this. Use a HTML parser like JSoup Commented Jun 9, 2013 at 21:07
  • It is for my school project and I have to use regular expressions... Commented Jun 9, 2013 at 21:08
  • Do not use regular expressions for parsing html: stackoverflow.com/questions/1732348/… Commented Jun 9, 2013 at 21:11
  • 1
    @MichaWiedenmann from the link: Even Jon Skeet cannot parse HTML using regular expressions. this sentence made my day :). Commented Jun 9, 2013 at 21:12
  • 1
    @vikifor "I have to use regular expressions..." <-- no, you have to change teachers Commented Jun 9, 2013 at 21:23

2 Answers 2

5

It is blasphemy to use regex + HTML. But if you really want to be cursed then here it is (you have been warned):


String regExp = "<h3 class=\"field-content\"><a[^>]*>([\\w\\s]+)</a></h3>";
                                                       ^updated part

Since Проба 1 and Проба 2 contains also spaces you need to include \\s to your pattern.

Sign up to request clarification or add additional context in comments.

4 Comments

If you talk about blasphemy, you should not play devils advocate, now should you? :-)
It isn't blasphemy, it is sacrilege. :)
I know I am playing with fire here but there is no fun without risk }:->
@vikifor that is one of the reasons to use tools designed for such tasks like jsoup.org.
1

To discover the power of the dark side, you can try this pattern:

<h3 class=\"field-content\"><a[^>]*>([^<]+)</a></h3>

Don't forget to set the UNICODE_CASE before.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.