2

I need to scrape the data from an HTML table and orientate the columnar data as rows of a 2d array.

My code does not display the correct structure.

HTML Table:

<html>
<head>
</head>
<body>
<table>
<tbody>
    <tr>
        <td>header</td>
        <td>header</td>
    <td>header</td>
</tr>
<tr>
    <td>content</td>
    <td>content</td>
    <td>content</td>
</tr>
<tr>
<td>test</td>
<td>test</td>
<td>test</td>
</tr>
</tbody>
</table>
</body>
</html>

PHP CODE:

$DOM = new \DOMDocument();
$DOM->loadHTML($valdat["table"]);
                
$Header = $DOM->getElementsByTagName('tr')->item(0)->getElementsByTagName('td');
$Detail = $DOM->getElementsByTagName('td');
      
//#Get header name of the table
foreach($Header as $NodeHeader) 
{
    $aDataTableHeaderHTML[] = trim($NodeHeader->textContent);
}
//print_r($aDataTableHeaderHTML); die();

//#Get row data/detail table without header name as key
$i = 0;
$j = 0;

foreach($Detail as $sNodeDetail) 
{
    $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
    $i = $i + 1;
    $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}
//print_r($aDataTableDetailHTML); die();

//#Get row data/detail table with header name as key and outer array index as row number
for($j = 0; $j < count($aDataTableHeaderHTML); $j++)
{
    for($i = 1; $i < count($aDataTableDetailHTML); $i++)
    {

        $aTempData[][$aDataTableHeaderHTML[$j]][] = $aDataTableDetailHTML[$i][$j];
    }
}

$aDataTableDetailHTML = $aTempData;
echo json_encode($aDataTableDetailHTML);

My result:

[{"header":["content"]},{"header":["test"]},{"header":["content"]},{"header":["test"]},{"header":["content"]},{"header":["test"]}]

We need such a result:

[
   ["header","content","test"],
   ["header","content","test"],
   ["header","content","test"]
]

4 Answers 4

2

I've changed a lot of the code to (hopefully) simplify it. This works in two stages, the first is to extract the <tr> elements and build up an array of all of the <td> elements in each row - storing the results into $rows.

Secondly is to tie up the data vertically by looping across the first row and then using array_column() to extract the corresponding data from all of the rows...

$trList = $DOM->getElementsByTagName("tr");
$rows = [];
foreach ( $trList as $tr )  {
    $row = [];
    foreach ( $tr->getElementsByTagName("td") as $td )  {
        $row[] = trim($td->textContent);
    }
    $rows[] = $row;
}

$aDataTableDetailHTML = [];
foreach ( $rows[0] as $col => $value )  {
    $aDataTableDetailHTML[] = array_column($rows, $col);
}
echo json_encode($aDataTableDetailHTML);

Which with the test data gives...

[["header","content","test"],["header","content","test"],["header","content","test"]]
Sign up to request clarification or add additional context in comments.

Comments

1

I have added some extra code, it will chunk the $aDataTableDetailHTML array into the two values, and then add the key, in this case "header"

//There are two elements that are not "header"
$aDataTableDetailHTML = array_chunk($aTempData, 2);

//For every item in the array
foreach($aDataTableDetailHTML as $key=>$tag){
    //Dynamically get the name, in this case, "header"
    $tagName = array_keys( $tag[0] )[0];

    //Start an array containing the tagname ("header")
    $tagOut = array( $tagName );

    //Add the two values onto the array
    $tagOut[] = $tag[0][$tagName][0];
    $tagOut[] = $tag[1][$tagName][0];

    //Drop the keys from the array
    $aDataTableDetailHTML[$key] = array_values( $tagOut );
}

echo json_encode($aDataTableDetailHTML);

This gave me the output:

[ [ "header", "content", "test" ], [ "header", "content", "test" ], [ "header", "content", "test" ] ]

Which seems to match what you were after. Hope that this helps.

I also tested some additional values, and the pattern continued to carry.

1 Comment

This does not work with simpler tables. Example: 1 2 3 - header 1 2 3 - body
0

I know this answere comes late, but I have developed a package for this purpose. It is called TableDude.

For your case this PHP snippet will work.


// Including TableDude
require __DIR__ . "/../src/autoload.php";

$html = "<html>
<head>
</head>
<body>
<table>
<tbody>
    <tr>
        <td>header</td>
        <td>header</td>
    <td>header</td>
</tr>
<tr>
    <td>content</td>
    <td>content</td>
    <td>content</td>
</tr>
<tr>
<td>test</td>
<td>test</td>
<td>test</td>
</tr>
</tbody>
</table>
</body>
</html>";

// Parses the HTML to array table
$simpleParser = new \TableDude\Parser\SimpleParser($html);
$parsedTables = $simpleParser->parseHTMLTables();

if(count($parsedTables) > 0)
{
    $firstTable = $parsedTables[0];
    $tableOrderedByColumn = \TableDude\Tools\ArrayTool::swapArray($firstTable);
    print_r($tableOrderedByColumn);
}

// This would output
/*
array(
   array("header", "content", "test"),
   array("header", "content", "test"),
   array("header", "content", "test")
)
*/

Comments

0

To maintain the parent child relationships between rows and cells, access td tags within the context of tr tags.

Transposing data structures is done by swapping the first level keys with the second level keys.

Code: (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html);
$result = [];
foreach ($dom->getElementsByTagName('tr') as $i => $row) {
    foreach ($row->getElementsByTagName('td') as $c => $cell) {
        $result[$c][$i] = $cell->nodeValue;
    }
}
var_export($result);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.