HTML Validation using PHP

Question

I have a number of sites on a server that have multiple employees working on them daily. As a result of so many people handling these files, it can be hard to keep up with certain things. I am building a PHP based tool that will scan these pages for things like broken links, spelling errors, and other such things.

Right now I am working on the HTML validation portion of this, namely missing/extra tag close and opens. I found a post on here (PHP Based HTML Validator) that led me to a pear package that links in with W3C validation.

I have not tried this yet, as the last version released is almost 5 years old. Can anyone help me get my nose pointed in the right direction.

While the last version "released" through Pear is almost 5 years old, if you take a look on the project's GitHub repository you'll see plenty of recent activity. — Matt Gibson
– Matt Gibson, Commented Jan 20, 2015 at 20:33

Community · Accepted Answer · 2023-11-17 19:32:19Z

2

For this job PHP has built in function array libxml_get_errors ( void ) that will return an array of errors. Take a look at this documentation. There is also an example.

My test with page body:

<?php

libxml_use_internal_errors(true);

$xmlstr = <<< XML
    <body>
        <h1>Correct tag</h1>
        <h2>Tag not closed</h2>
        <p>Missing end of paragraph
        <br>
        <script type="text/javascript">
        var test = "Script";
        </script>
        <img src="some.url" alt="Image title" >
        <footer>Some error in footer?<footer>
    </body>
XML;

$doc = simplexml_load_string($xmlstr);
$xml = explode("\n", $xmlstr);

if (!$doc) {
    $errors = array_reverse ( libxml_get_errors() );
    echo "<pre>";
    foreach ($errors as $error) {
        echo display_xml_error($error, $xml);
    }
    echo "</pre>";
    libxml_clear_errors();
}


function display_xml_error($error, $xml)
{
    $return  = $xml[$error->line - 1] . "\n";
    $return .= str_repeat('-', $error->column) . "^\n";

    switch ($error->level) {
        case LIBXML_ERR_WARNING:
            $return .= "Warning $error->code: ";
            break;
         case LIBXML_ERR_ERROR:
            $return .= "Error $error->code: ";
            break;
        case LIBXML_ERR_FATAL:
            $return .= "Fatal Error $error->code: ";
            break;
    }

    $return .= trim($error->message);

    if ($error->file) {
        $return .= "\n  File: $error->file";
    }

    return "$return\n\n--------------------------------------------\n\n";
}

?>

Results with:

---------^
Fatal Error 77: Premature end of data in tag body line 1

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag p line 4

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag br line 5

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag img line 9

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag footer line 10

--------------------------------------------

    
---------^
Fatal Error 76: Opening and ending tag mismatch: footer line 10 and body

--------------------------------------------

Do not be confused with error for body not closed. In case HTML is valid, than there are no errors dropped. For example, the following code has no errors according to array libxml_get_errors():

<body>
    <h1>Correct tag</h1>
    <h2>Tag closed</h2>
    <p>Not missing end of paragraph</p>
<br />
    <script type="text/javascript">
    var test = "Script";
    </script>
        <img src="some.url" alt="Image title" />
        <div class="somediv">
            <p>Paragraph nested</p> 
            <ul>
                <li>List element</li>
                <li>List element</li>
            </ul>
        </div>
        <footer>No error in footer</footer>
</body>

edited Nov 17, 2023 at 19:32

CommunityBot

11 silver badge

answered Jan 20, 2015 at 20:53

skobaljic

9,6841 gold badge29 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

David Webb Over a year ago

this gave me what i need to get started on investigating and tweaking this to meet my needs. thank you!

skobaljic Over a year ago

HTML should be well formed, so for broken tags it works fine. Take a look at the small test I made above.

Matt Gibson Over a year ago

Well, yes, but for example <p>This is a paragraph.<p>This is another paragraph</p> is valid, well-formed HTML5, as HTML5 doesn't follow the XML rules. Paragraphs can be considered closed when a new paragraph comes along, and that's perfectly valid according to the spec.

skobaljic Over a year ago

It is important he gets all the warnings, he can ignore paragraph, list, body, break tags...

Misunderstood · Accepted Answer · 2015-01-20 21:08:03Z

-1

I have a validation app that runs multiple validations.

First I check the page for basic parameters and server configuration.

Parameters that in range, printout Green, out of range: Red, in between: Yellow

Web Server Configuration
Base Page HTML
Compression enabled: Yes
Cache specified: Yes max-age=31536000, public
Expiration specified: Yes
Keep Alive Specified: Yes
Characterset Specified: Yes
Web Server Performance Metrics
Base Page HTML
Base Page Size: 17,126 Bytes
Transmission Speed: 2,137,041 Bytes/Sec.
Compression: 1.9X
HTML Whitespace: 0.0%
Bytes Transmitted: 9,029 Bytes
HTML Transfer Rate: 4,053,491 Bytes/Sec.
Resolve Domain Name: 0.107 Sec.
Connect Time: 0.107
Transfer Time: 0.004 Sec.
Generate HTML: 0.195 Sec.
Total Time: 0.307 Sec.

Then I run these 4 Validation Tools with curl.

W3C HTML Markup
W3C CSS Validation Service
W3C mobileOK Checker
Web Page Test

Then report the results like:

No CSS Errors
No HTML Errors
W3C mobileOK 96%
Page Speed Score: 99%

edited Jan 20, 2015 at 21:08

answered Jan 20, 2015 at 20:45

Misunderstood

5,6791 gold badge20 silver badges25 bronze badges

2 Comments

David Webb Over a year ago

I actually have built the spell check and link checker already using curl to check the http_code, and curl with the built in enchant library to check the words on the page. I just need to figure out the validation side of it.

jeroen Over a year ago

How does this answer the question exactly?

Collectives™ on Stack Overflow

HTML Validation using PHP

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related