0

I'm trying to get the values of the following table. I tried both curl/regex (I know it's not recommended) and DOM separately, but wasn't able to get the values properly.

There are multiple rows in the page, so I'll need to use a foreach. I need an exact match of the structure below.

<tr>
    <td width="75" style="NS">
        <img src="NS" width="64" alt="INEEDTHISVALUE">
    </td>
    <td style="NS">
        <a href="NS">NS</a>
    </td>
    <td style="NS">INEEDTHISVALUETOO</td>
</tr>

NS = Non-static values. They change for each td and a since it's a colored (inline css) table. They may contain special characters like ; / or numbers/alphabetical characters.

I'm using simple_html_dom class which can be found here : http://htmlparsing.com/php.html

I'm using the code below to get all td's, but I need more specific output (I included the table row above)

What I've tried so far :

$html = file_get_html("URL");
foreach($html->find('td') as $td) {
    echo $td."<br>";
}

REGEX & CURL

$site = "URL";
$ch = curl_init();
$hc = "YahooSeeker-Testing/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; Yahoo! Search - Web Search)";
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_USERAGENT, $hc);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$site = curl_exec($ch);
curl_close($ch);
preg_match_all('@<tr><td width="75" style="(.*?)"><img src="/folder/link/(.*?)" width="64" alt="(.*?)"></td><td style="(.*?)"><a href="/folder2/link2/(.*?)">(.*?)</a></td><td style="(.*?)">(.*?)</td></tr>@', $site, $arr);
var_dump($arr); // returns empty array, WHY?

1 Answer 1

1

You can do it like this without a library:

$results = array();
$doc = new DOMDocument();
$doc->loadHTML($site);
$xpath = new DOMXPath($doc);

foreach ($xpath->query('//tr') as $tr) {
    $results[] = array(
        'img_alt' => $xpath->query('td[1]/img', $tr)->item(0)->getAttribute('alt'),
        'td_text' => $xpath->query('td[last()]', $tr)->item(0)->nodeValue
    );
}

print_r($results);

It will give you:

Array
(
    [0] => Array
        (
            [img_alt] => INEEDTHISVALUE 1
            [td_text] => INEEDTHISVALUETOO 1
        )

    [1] => Array
        (
            [img_alt] => INEEDTHISVALUE 2
            [td_text] => INEEDTHISVALUETOO 2
        )

)

Relevant documentation: PHP: DOMXPath::query

Sign up to request clarification or add additional context in comments.

2 Comments

It works, thank you. But I can't load an external html file with that way, I'll look into the documentation to do that. Thanks!
But it doesn't, I think it's being broken by the HTML file. I get an error like this : Notice: DOMDocument::loadHTML(): Namespace prefix g is not defined in Entity, line: 167 in /Applications/MAMP/htdocs/fetch/test.php on line 153 Warning: DOMDocument::loadHTML(): Tag g:plusone invalid in Entity, line: 167 in /Applications/MAMP/htdocs/fetch/test.php on line 153 Fatal error: Call to a member function getAttribute() on null in /Applications/MAMP/htdocs/fetch/test.php on line 158

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.