9

i'm trying to parse some html that is not on my server

    $dom = new DOMDocument();
    $dom->loadHTMLfile("http://www.some-site.org/page.aspx");      
    echo    $dom->getElementById('his_id')->item(0);

but php returns an error something like ID his_id already defined in http://www.some-site.org/page.aspx, line: 33. I think that is because DOMDocument is dealing with invalid html. So, how can i parse it even though is invalid?

3 Answers 3

9

You should run HTML Tidy on it to clean it up before parsing it.

$html = file_get_contents('http://www.some-site.org/page.aspx');
$config = array(
  'clean' => 'yes',
  'output-html' => 'yes',
);
$tidy = tidy_parse_string($html, $config, 'utf8');
$tidy->cleanRepair();
$dom = new DOMDocument;
$dom->loadHTML($tidy);

See this list of options.

Sign up to request clarification or add additional context in comments.

4 Comments

tidy is not available for me :(
@kmunky why not? Without Tidy you're SOL, basically.
i solved the problem...i have installed php_tidy but i get the following error "ID top already defined in Entity, line: 52"
Duplicated id's, you'll have to fix them yourself (been there, done that).
2

Have a look at: libxml_use_internal_errors()

http://php.net/libxml_use_internal_errors

1 Comment

If you are merely going to recommend a link, please do so as a comment under the question instead of posting as an answer.
0

Reading the docs, I see a $dom->strictErrorChecking that defaults to TRUE. What happens if you set $dom->strictErrorChecking = false?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.