4

Since I have to multithread which I can not eloquently solve in PHP I would like to programm in Java, unfortunately I could not finde a library which will allow me to parse a HTML DOM as robustly, quickly and easily as in PHP Simple HTML DOM Parser. Do you know alternatives in Java that are as easy to use?

4 Answers 4

7

I went from Simple HTML DOM Parser to JSoup and I'm quite happy with it.

Sign up to request clarification or add additional context in comments.

3 Comments

While looking deeper I just found that one and it shows off quite nice list of features and APIs. Finding elements by CSS selectors is quite nifty.
On a first glance the functionality of JSoup even seems to exceed the functionality of PHP Simple HTML DOM Parser. Now I will compare it to the second suggesion TagSoup, any pros and cons on this?
The approach with TagSoup, W3C DOM and DOM4J/JDOM should work, but is more complex than JSoup seams to be. I'll give it a try as well, as the project looks very good to me from the description.
3

I can see that we have two challenges here:

  • Parsing of HTML that might not be well-formed XHTML that ease any and nice to parse. I'd recommend TagSoup library that can read ugly HTML and produce well-formed StaX stream that can be then used elsewhere.

  • Building of DOM representaion of HTML document and dealing with that. As you probably know in JDK there is full-blown implementation of XML DOM (org.w3c.dom.*). But I guess this is not the type of API you've been looking for. What about DOM4J or older JDOM that can wrap JDK Document and you can enjoy easy to use API?

1 Comment

I was looking for option one, parsing htm that is never really well formed in real life. Accessing the XML DOM with XPATH is really tricky and I just failed coding bullet proof code. TagSoup seems to be a good suggestion, now the question is what suits me better JSoup or TagSoup.
0

I've successfully used TagSoup as a SAX parser to populate DOM4J Documents which I then query with XPath. It took me a while to work out the incantations - (Scala, but I'm sure that you can convert):

parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val reader = new SAXReader(parserFactory.newSAXParser.getXMLReader)
val doc = reader.read(new InputSource(new StringReader(page)))

Comments

0

JSoup is a good choice. Here is an example which clears HTML from elements using certain CSS classes and from comments: it's very simple.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;

...

public String cleanHtml(String htmlText) {
    Document doc = Jsoup.parse(htmlText);
    doc.select(".infobox").remove();
    doc.select(".mw-editsection").remove();
    doc.select(".hatnote").remove();
    doc.select(".catlinks").remove();
    doc.select(".noprint").remove();
    doc.select(".metadata").remove();
    doc.select(".toc").remove();
    doc.select("style").remove();
    doc.select("script").remove();
    doc.select("figure").remove();
    doc.select("*[style*=display:none]").remove();
    removeComments(doc);

    return doc.html();
}

private static void removeComments(Node node) {
    node.childNodes().stream().filter(n -> "#comment".equals(n.nodeName())).forEach(Node::remove);
    node.childNodes().forEach(WikipediaCommons::removeComments);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.