0

I have a number of sites on a server that have multiple employees working on them daily. As a result of so many people handling these files, it can be hard to keep up with certain things. I am building a PHP based tool that will scan these pages for things like broken links, spelling errors, and other such things.

Right now I am working on the HTML validation portion of this, namely missing/extra tag close and opens. I found a post on here (PHP Based HTML Validator) that led me to a pear package that links in with W3C validation.

I have not tried this yet, as the last version released is almost 5 years old. Can anyone help me get my nose pointed in the right direction.

1
  • While the last version "released" through Pear is almost 5 years old, if you take a look on the project's GitHub repository you'll see plenty of recent activity. Commented Jan 20, 2015 at 20:33

2 Answers 2

2

For this job PHP has built in function array libxml_get_errors ( void ) that will return an array of errors. Take a look at this documentation. There is also an example.

My test with page body:

<?php

libxml_use_internal_errors(true);

$xmlstr = <<< XML
    <body>
        <h1>Correct tag</h1>
        <h2>Tag not closed</h2>
        <p>Missing end of paragraph
        <br>
        <script type="text/javascript">
        var test = "Script";
        </script>
        <img src="some.url" alt="Image title" >
        <footer>Some error in footer?<footer>
    </body>
XML;

$doc = simplexml_load_string($xmlstr);
$xml = explode("\n", $xmlstr);

if (!$doc) {
    $errors = array_reverse ( libxml_get_errors() );
    echo "<pre>";
    foreach ($errors as $error) {
        echo display_xml_error($error, $xml);
    }
    echo "</pre>";
    libxml_clear_errors();
}


function display_xml_error($error, $xml)
{
    $return  = $xml[$error->line - 1] . "\n";
    $return .= str_repeat('-', $error->column) . "^\n";

    switch ($error->level) {
        case LIBXML_ERR_WARNING:
            $return .= "Warning $error->code: ";
            break;
         case LIBXML_ERR_ERROR:
            $return .= "Error $error->code: ";
            break;
        case LIBXML_ERR_FATAL:
            $return .= "Fatal Error $error->code: ";
            break;
    }

    $return .= trim($error->message);

    if ($error->file) {
        $return .= "\n  File: $error->file";
    }

    return "$return\n\n--------------------------------------------\n\n";
}

?>

Results with:

---------^
Fatal Error 77: Premature end of data in tag body line 1

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag p line 4

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag br line 5

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag img line 9

--------------------------------------------

    
---------^
Fatal Error 77: Premature end of data in tag footer line 10

--------------------------------------------

    
---------^
Fatal Error 76: Opening and ending tag mismatch: footer line 10 and body

--------------------------------------------

Do not be confused with error for body not closed. In case HTML is valid, than there are no errors dropped. For example, the following code has no errors according to array libxml_get_errors():

<body>
    <h1>Correct tag</h1>
    <h2>Tag closed</h2>
    <p>Not missing end of paragraph</p>
<br />
    <script type="text/javascript">
    var test = "Script";
    </script>
        <img src="some.url" alt="Image title" />
        <div class="somediv">
            <p>Paragraph nested</p> 
            <ul>
                <li>List element</li>
                <li>List element</li>
            </ul>
        </div>
        <footer>No error in footer</footer>
</body>
Sign up to request clarification or add additional context in comments.

4 Comments

this gave me what i need to get started on investigating and tweaking this to meet my needs. thank you!
HTML should be well formed, so for broken tags it works fine. Take a look at the small test I made above.
Well, yes, but for example <p>This is a paragraph.<p>This is another paragraph</p> is valid, well-formed HTML5, as HTML5 doesn't follow the XML rules. Paragraphs can be considered closed when a new paragraph comes along, and that's perfectly valid according to the spec.
It is important he gets all the warnings, he can ignore paragraph, list, body, break tags...
-1

I have a validation app that runs multiple validations.

First I check the page for basic parameters and server configuration.

Parameters that in range, printout Green, out of range: Red, in between: Yellow

Web Server Configuration
Base Page HTML
Compression enabled: Yes
Cache specified: Yes max-age=31536000, public
Expiration specified: Yes
Keep Alive Specified: Yes
Characterset Specified: Yes
Web Server Performance Metrics
Base Page HTML
Base Page Size: 17,126 Bytes
Transmission Speed: 2,137,041 Bytes/Sec.
Compression: 1.9X
HTML Whitespace: 0.0%
Bytes Transmitted: 9,029 Bytes
HTML Transfer Rate: 4,053,491 Bytes/Sec.
Resolve Domain Name: 0.107 Sec.
Connect Time: 0.107
Transfer Time: 0.004 Sec.
Generate HTML: 0.195 Sec.
Total Time: 0.307 Sec.

Then I run these 4 Validation Tools with curl.

  • W3C HTML Markup
  • W3C CSS Validation Service
  • W3C mobileOK Checker
  • Web Page Test

Then report the results like:

  • No CSS Errors
  • No HTML Errors
  • W3C mobileOK 96%
  • Page Speed Score: 99%

2 Comments

I actually have built the spell check and link checker already using curl to check the http_code, and curl with the built in enchant library to check the words on the page. I just need to figure out the validation side of it.
How does this answer the question exactly?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.