0

I am trying to parse a page with HtmlUnit and the html has a defect in it, where table columns are ended with <?td> instead of </td>. Unfortunately I can't fix the html myself on the server-side as I don't own the project, so I need to work around this.

I noticed that when I save the page onto my hard drive from Chrome (right-click -> save as) and then I open the file that I've saved and view the source (right-click -> view page source), Chrome has magically fixed the error in the actual html. After the page has been saved and re-opened by Chrome I see this in the source <td> <!--?td--> </td>, so it seems like Chrome has detected the error, commented it out and replaced it with the correct tag.

Is it possible to get HtmlUnit to do something similar? Either automatically, or can I implement some kind of filter myself to replace all <?td> with </td> before it parses it into an HtmlPage? I see that I can implement my own IncorrectnessListener for the WebClient, perhaps something in there? I haven't been able to figure it out so any help would be appreciated.

1 Answer 1

1

Html parsers have some heuristics to deal with invalid html content. Usually this inserts missing end tags in many situations. In your case the browser simply has detected an unsupported tag and added the missing td-end-tag at the (more or less) correct position because the next td-start-tag requires a closing td tag before.

HtmlUnit (using NekoHtml) tries to implement the same heuristics as browser do. So you can simply load the page and then save the Page with asXml() as XHtml. You should also see the inserted td-end tags. But HtmlUnit will not preserve the wrong tags as comment (i guess).

If you think there is something wrong with the heuristics implemented by HtmlUnit (or there are different from the one used by the browser) you can open an issue (and please provide a minimal detailed sample) and i will try to fix this.

If you really have to patch the incoming HtmlCode please have a lock at the FAQ page (How to modify the outgoing request or incoming response?).

Sign up to request clarification or add additional context in comments.

9 Comments

Ok I'll play around with it and see if I can come up with a specific example of what's breaking for me and send it over. In the meantime, am I right in thinking that for the subclassed WebConnection, it would be set on the WebClient like webClient.setWebConnection(new WebConnectionWrapper(webClient));?
Here is an example of a page that isn't working for me: racingzone.com.au/results/2010-01-01 if you look at the div with id="container", I would expect the 4th child to be <table>. But when opening this url with HtmlUnit, the 4th (up to 15th) children are HtmlTableColumn - not in a table, and not in rows. I think it might be something to do with the column width definitions not being in a <colgroup>? Chrome automatically wraps them in a colgroup when saving the file locally
Will have a look, but please open an HtmlUnit issue.
Ok this is fixed now. Will take some time to get a new build and make a new snapshot available. Please have a look at twitter.com/htmlunit; will inform if a new snapshot build is available.
There is a new HtmlUnit release (2.31) available that includes the fix.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.