0

I am building an application that receives HTML content as strings. I need to verify that these HTML strings are well-formed, meaning I want to parse them and detect lines with errors.

During my research, I found that Jsoup might be a useful tool for this task. However, I am encountering issues getting it to work properly.

Parser parser = Parser.htmlParser().setTrackErrors(10); 
Document document = Jsoup.parse("<div><-div>"); 
List<ParseError> errors = parser.getErrors(); 
Document document = Jsoup.parse("<div>/div>", "", htmlParser()); 
List<ParseError> errors = document().parser().getErrors(); 

Despite intentionally introducing errors in larger HTML strings, the getErrors() method always returns an empty list. Am I doing something wrong, or is Jsoup not suitable for this task? Additionally, if Jsoup isn't the right tool, could someone recommend another library or method to achieve my goal?

1
  • JSoup parses HTML. As in, what's out there. Which is very much not in any way well formed. Indeed, JSoup is completely the wrong tool for what you want. Any XML parser will get the job done - just, filter through the tags, and fail if any of them aren't known HTML tags. Note that the concept of 'well formed HTML' is a complete failure. Nobody* cares about well formed HTML. Trying to enforce it tends to lead to problems. Commented May 26, 2024 at 0:50

1 Answer 1

1

You have to use the parser you defined. Your first example does not pass parser to Jsoup.parse().

Your second example doesn't call setTrackErrors(10). That's why you don't detect any errors in that example.

Something like this should work:

// String badHtml = "<div>/div>";
String badHtml = "<div><-div>";
Parser parser = Parser.htmlParser().setTrackErrors(10); 
Document document = Jsoup.parse(badHtml, parser);

List<ParseError> errors = document.parser().getErrors();
for (ParseError error : errors) {
    System.out.println(error);
}
// Output:
// <1:7>: Unexpected character '-' in input state [TagOpen]
// <1:12>: Unexpected EOF token [] when in state [InBody]
Sign up to request clarification or add additional context in comments.

2 Comments

The prettyPrint setting has nothing to do with tracking errors though. Your example works the same with pretty printing on or off. Pretty printing controls how jsoup formats .html() output, after the parse.
@JohnathanHedley Ah, my mistake! Let me fix that...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.