Java equivalent to PHP Simple HTML DOM Parser

Question

Since I have to multithread which I can not eloquently solve in PHP I would like to programm in Java, unfortunately I could not finde a library which will allow me to parse a HTML DOM as robustly, quickly and easily as in PHP Simple HTML DOM Parser. Do you know alternatives in Java that are as easy to use?

RubenGM · Accepted Answer · 2011-05-30 13:38:38Z

7

I went from Simple HTML DOM Parser to JSoup and I'm quite happy with it.

answered May 30, 2011 at 13:38

RubenGM

1701 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tomasz Blachowicz Over a year ago

While looking deeper I just found that one and it shows off quite nice list of features and APIs. Finding elements by CSS selectors is quite nifty.

Dominik Over a year ago

On a first glance the functionality of JSoup even seems to exceed the functionality of PHP Simple HTML DOM Parser. Now I will compare it to the second suggesion TagSoup, any pros and cons on this?

Tomasz Blachowicz Over a year ago

The approach with TagSoup, W3C DOM and DOM4J/JDOM should work, but is more complex than JSoup seams to be. I'll give it a try as well, as the project looks very good to me from the description.

Tomasz Blachowicz · Accepted Answer · 2011-05-30 13:36:38Z

3

I can see that we have two challenges here:

Parsing of HTML that might not be well-formed XHTML that ease any and nice to parse. I'd recommend TagSoup library that can read ugly HTML and produce well-formed StaX stream that can be then used elsewhere.
Building of DOM representaion of HTML document and dealing with that. As you probably know in JDK there is full-blown implementation of XML DOM (org.w3c.dom.*). But I guess this is not the type of API you've been looking for. What about DOM4J or older JDOM that can wrap JDK Document and you can enjoy easy to use API?

answered May 30, 2011 at 13:36

Tomasz Blachowicz

5,8619 gold badges44 silver badges48 bronze badges

1 Comment

Dominik Over a year ago

I was looking for option one, parsing htm that is never really well formed in real life. Accessing the XML DOM with XPATH is really tricky and I just failed coding bullet proof code. TagSoup seems to be a good suggestion, now the question is what suits me better JSoup or TagSoup.

Duncan McGregor · Accepted Answer · 2011-05-30 13:48:41Z

0

I've successfully used TagSoup as a SAX parser to populate DOM4J Documents which I then query with XPath. It took me a while to work out the incantations - (Scala, but I'm sure that you can convert):

parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val reader = new SAXReader(parserFactory.newSAXParser.getXMLReader)
val doc = reader.read(new InputSource(new StringReader(page)))

answered May 30, 2011 at 13:48

Duncan McGregor

18.2k13 gold badges71 silver badges126 bronze badges

Comments

Francesco Sblendorio · Accepted Answer · 2023-11-30 09:12:28Z

JSoup is a good choice. Here is an example which clears HTML from elements using certain CSS classes and from comments: it's very simple.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;

...

public String cleanHtml(String htmlText) {
    Document doc = Jsoup.parse(htmlText);
    doc.select(".infobox").remove();
    doc.select(".mw-editsection").remove();
    doc.select(".hatnote").remove();
    doc.select(".catlinks").remove();
    doc.select(".noprint").remove();
    doc.select(".metadata").remove();
    doc.select(".toc").remove();
    doc.select("style").remove();
    doc.select("script").remove();
    doc.select("figure").remove();
    doc.select("*[style*=display:none]").remove();
    removeComments(doc);

    return doc.html();
}

private static void removeComments(Node node) {
    node.childNodes().stream().filter(n -> "#comment".equals(n.nodeName())).forEach(Node::remove);
    node.childNodes().forEach(WikipediaCommons::removeComments);
}

Collectives™ on Stack Overflow

Java equivalent to PHP Simple HTML DOM Parser

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related