4

I need to find a PHP based HTML (as in WC3-Like) Validator that can look for invalid HTML or XHTML. I've searched Google a little, but was curious if anyone has used one they particularly liked.

I have the HTML in a string:

$html = "<html><head>.....</body></html>";

And I would like to be able to test the page, and have it return the errors. (Not echo/print anything)

I've seen:
-http://www.bermi.org/xhtml_validator
-http://twineproject.sourceforge.net/doc/phphtml.html

The background for this is that I'd like to have a function/class that I run on every page, check if the file has been modified since the last access date (or something similar to that), and if it hasn't, run the validator so I am immediately notified of invalid HTML while coding.

4 Answers 4

6

There's no need to reinvent the wheel on this one. There's already a PEAR library that interfaces with the W3C HTML Validator API. They're willing to do the work for you, so why not let them? :)

Sign up to request clarification or add additional context in comments.

2 Comments

Pretty cool, but you have to rely on thier webservice. This means you must be connected to a public internet. very neat though.
This is an definitely an option.
2

While it isn't strictly PHP, (it is a executable) one i really like is w3c's HTML tidy. it will show what is wrong with the HTML, and fix it if you want it to. It also beautifies HTML so it doesn't look like a mess. runs from the command line and is easy to integrate into php.

check it out. http://www.w3.org/People/Raggett/tidy/

Comments

0

If you can't use Tidy (sometimes hosting service do not activate this php module), you can use this PHP class: http://www.barattalo.it/html-fixer/

Comments

-2

I had a case where I needed to check partial html code for unmatched and malformed tags (mostly, eg </br>, a common error in my samples) and various heavy-duty validators were too much to use. So I ended up making my own custom validation routine in PHP, it is pasted below (you may need to use mb_substr instead of index-based character retrieval if you have text in different languages) (note it does not parse CDATA or script/style tags but can be extended easily):

function check_html( $html )
{
    $stack = array();
    $autoclosed = array('br', 'hr', 'input', 'embed', 'img', 'meta', 'link', 'param', 'source', 'track', 'area', 'base', 'col', 'wbr');
    $l = strlen($html); $i = 0;
    $incomment = false; $intag = false; $instring = false;
    $closetag = false; $tag = '';
    while($i<$l)
    {
        while($i<$l && preg_match('#\\s#', $c=$html[$i])) $i++;
        if ( $i >= $l ) break;
        if ( $incomment && ('-->' === substr($html, $i, 3)) )
        {
                // close comment
                $incomment = false;
                $i += 3;
                continue;
        }
        $c = $html[$i++];
        if ( '<' === $c )
        {
            if ( $incomment ) continue;
            if ( $intag )  return false;
            if ( '!--' === substr($html, $i, 3) )
            {
                // open comment
                $incomment = true;
                $i += 3;
                continue;
            }

            // open tag
            $intag = true;
            if ( '/' === $html[$i] )
            {
                $i++;
                $closetag = true;
            }
            else
            {
                $closetag = false;
            }
            $tag = '';
            while($i<$l && preg_match('#[a-z0-9\\-]#i', $c=$html[$i]) )
            {
                $tag .= $c;
                $i++;
            }
            if ( !strlen($tag) ) return false;
            $tag = strtolower($tag);
            if ( $i<$l && !preg_match('#[\\s/>]#', $html[$i]) ) return false;
            if ( $i<$l && $closetag && preg_match('#^\\s*/>#sim', substr($html, $i)) ) return false;
            if ( $closetag )
            {
                if ( in_array($tag, $autoclosed) || (array_pop($stack) !== $tag) )
                    return false;
            }
            else if ( !in_array($tag, $autoclosed) )
            {
                $stack[] = $tag;
            }
        }
        else if ( '>' ===$c )
        {
            if ( $incomment ) continue;
            
            // close tag
            if ( !$intag ) return false;
            $intag = false;
        }
    }
    return !$incomment && !$intag && empty($stack);
}

2 Comments

It is a very bad idea to write your own HTML parser, especially if your code will be used on untrusted inputs. HTML is very complex. The parsing rules for HTML5 are very complicated and handle many nuanced edge cases. For some common misconceptions about HTML that may trip up "roll your own" parsers, see: alanhogan.com/html-myths#close-tags
There are cases (like mine) where a very simple custom parser was all that was needed and could not find such simple one elsewhere. So this is offered for such cases, else I totally agree with you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.