java extracting from string

Question

I've got an array of strings similar to

 <div id="option1">hello</div>
 <div style="color: cyan">world</div>

Is there a way that I can extract the information from within the divs? I've already written something but it's not dynamic (I have to specify the length of the bit) which is useless on my application because the content inside the array ^ is not always the same.

Hope you can understand my question, I will reply asap if you need any more information.

I am using java.

have you tried some of the XML parsers available in java? Sax? Xerces? — Roy Kachouh
– Roy Kachouh, Commented Dec 19, 2011 at 18:30
@PetarMinchev, No, Chuck Norris doesn't use regexs. Data sees him coming and parses itself. — cdeszaq
– cdeszaq, Commented Dec 19, 2011 at 18:30
Don't dare to use regex though it sometimes happily work with HTML. — Ajinkya
– Ajinkya, Commented Dec 19, 2011 at 18:32

Wayne · Accepted Answer · 2011-12-19 18:54:58Z

3

A complete Jsoup example:

List<String> res = new ArrayList<String>();
String[] html = new String[] { 
    "<div id=\"option1\">hello</div>",
    "<div style=\"color: cyan\">world</div>" };
for (String el : html) {
    String text = Jsoup.parse(el).text();
    res.add(text);
    System.out.println(text);
}

Output:

hello
world

Note that the HTML from your example is well-formed XML and could be parsed using any XML parser, as well. You'll need an HTML-specific parser when dealing with input that is not well-formed.

answered Dec 19, 2011 at 18:54

Wayne

60.5k15 gold badges135 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

cdeszaq · Accepted Answer · 2011-12-19 18:29:57Z

1

As @SLaks said, use an HTML parser. There are lots of good ones for Java. My favourite is jSoup.

answered Dec 19, 2011 at 18:29

cdeszaq

31.4k27 gold badges123 silver badges176 bronze badges

2 Comments

user1106495 Over a year ago

Thanks, is there anyway of doing this other than using an external library ?

cdeszaq Over a year ago

Not really. Java has a good XML parser "built in", but HTML is a very different beast.

Andy · Accepted Answer · 2011-12-19 20:35:28Z

If you know that there will only be one set of HTML tags, even better if you knew what tag it was, you might be able to do something like:

String[] html = new String[] { 
    "<div id=\"option1\">hello</div>",
    "<div style=\"color: cyan\">world</div>" };

for(String index : html){
    int firstEnd = index.firstIndexOf("/>");
    int lastBeginning = index.indexOf("<", 2); // Could become "</div>

    String contents = index.substring(firstEnd + 1, lastBeginning - 1);
    System.out.println(contents);
}

Please note that I haven't tested this code, nor written it in an IDE, so it may not be entirely correct but I think you can see where I am coming from. Just get the string between the closing ">" of the last tag before the information, and the opening "<" to the closing part of the previous tag...

I can also see that something like this code be modified to handle strings will multiple HTML tags with a bit of imagination...

Alternatively, and I can't believe I didn't think of this to start with, you could use something like the following. Though, again, it is limited to one HTML tag, though I'm sure you could come up with a tag-counting method if needed.

String[] html = new String[] { 
                "<div id=\"option1\">hello</div>",
                "<div style=\"color: cyan\">world</div>" };

        String tag = "div";
        Pattern p = Pattern.compile("<" + tag + ".*?>(.*?)</" + tag + ">");
        Matcher m;

        for(String index : html){
            m = p.matcher(index);
            while(m.find()) System.out.println(m.group(1));
        }

HTH

Collectives™ on Stack Overflow

java extracting from string

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related