1

I've got an array of strings similar to

 <div id="option1">hello</div>
 <div style="color: cyan">world</div>

Is there a way that I can extract the information from within the divs? I've already written something but it's not dynamic (I have to specify the length of the bit) which is useless on my application because the content inside the array ^ is not always the same.

Hope you can understand my question, I will reply asap if you need any more information.

I am using java.

7
  • 3
    You need an HTML parser. Commented Dec 19, 2011 at 18:27
  • 1
    Chuck Norris uses regex here:) Commented Dec 19, 2011 at 18:29
  • have you tried some of the XML parsers available in java? Sax? Xerces? Commented Dec 19, 2011 at 18:30
  • 1
    @PetarMinchev, No, Chuck Norris doesn't use regexs. Data sees him coming and parses itself. Commented Dec 19, 2011 at 18:30
  • Don't dare to use regex though it sometimes happily work with HTML. Commented Dec 19, 2011 at 18:32

3 Answers 3

3

A complete Jsoup example:

List<String> res = new ArrayList<String>();
String[] html = new String[] { 
    "<div id=\"option1\">hello</div>",
    "<div style=\"color: cyan\">world</div>" };
for (String el : html) {
    String text = Jsoup.parse(el).text();
    res.add(text);
    System.out.println(text);
}

Output:

hello
world

Note that the HTML from your example is well-formed XML and could be parsed using any XML parser, as well. You'll need an HTML-specific parser when dealing with input that is not well-formed.

Sign up to request clarification or add additional context in comments.

Comments

1

As @SLaks said, use an HTML parser. There are lots of good ones for Java. My favourite is jSoup.

2 Comments

Thanks, is there anyway of doing this other than using an external library ?
Not really. Java has a good XML parser "built in", but HTML is a very different beast.
0

If you know that there will only be one set of HTML tags, even better if you knew what tag it was, you might be able to do something like:

String[] html = new String[] { 
    "<div id=\"option1\">hello</div>",
    "<div style=\"color: cyan\">world</div>" };

for(String index : html){
    int firstEnd = index.firstIndexOf("/>");
    int lastBeginning = index.indexOf("<", 2); // Could become "</div>

    String contents = index.substring(firstEnd + 1, lastBeginning - 1);
    System.out.println(contents);
}

Please note that I haven't tested this code, nor written it in an IDE, so it may not be entirely correct but I think you can see where I am coming from. Just get the string between the closing ">" of the last tag before the information, and the opening "<" to the closing part of the previous tag...

I can also see that something like this code be modified to handle strings will multiple HTML tags with a bit of imagination...

Alternatively, and I can't believe I didn't think of this to start with, you could use something like the following. Though, again, it is limited to one HTML tag, though I'm sure you could come up with a tag-counting method if needed.

String[] html = new String[] { 
                "<div id=\"option1\">hello</div>",
                "<div style=\"color: cyan\">world</div>" };

        String tag = "div";
        Pattern p = Pattern.compile("<" + tag + ".*?>(.*?)</" + tag + ">");
        Matcher m;

        for(String index : html){
            m = p.matcher(index);
            while(m.find()) System.out.println(m.group(1));
        }

HTH

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.