2

I have a HTML table which has the following structure:

<tr>
    <td class='tablesortcolumn'>atest</td>
    <td >Kunde</td>
    <td ><a href="">[email protected]</a></td>
    <td align="right"><a href="module/dns_reseller/user_edit.php?ns=3&uid=6952"><img src="images/iconedit.gif" border="0"/></a> <img src="images/pixel.gif" width="2" height="1" border="0"/> <a href="module/dns_reseller/user.php?delete=true&uid=6952" onclick="return confirm('Möchten Sie den Datensatz wirklich löschen?');"><img src="images/icontrash.gif" border="0"/></a></td>
</tr>

There are hundreds of these tr blocks.

I want to extract atest and [email protected]

I tried the following:

$document = new DOMDocument();
$document->loadHTML($data);
$selector = new DOMXPath($document);
$elements = $selector->query("//*[contains(@class, 'tablesortcolumn')]");

foreach($elements as $element) {
  $text = $element->nodeValue;
  print($text);
  print('<br>');
}

Extracting atest is no problem, because I can get the element with the tablesortcolumn class. How can I get the email address?

I cannot simply use //table/tr/td/a because there are other elements on the website which are structured like this. So I need to get it by choosing an empty href tag. I already tried //table/tr/td/a[contains(@href, '')] but it returns the same as with //table/tr/td/a

Does anyone have an idea how to solve this?

1
  • an xpath axis like following-sibling could have perhaps helped you with that, too, if the email TD is always two TDs after the "atest" TD. Just saying. Commented Apr 15, 2015 at 18:11

4 Answers 4

2

can you try running an xpath that contains the string @? It seems unlikely that this would be used for anything else.

so something like this might work

//*[text()[contains(.,'@')]]
Sign up to request clarification or add additional context in comments.

5 Comments

That works! Thank you. Now, how can I combine the atest with [email protected]? Is there something like an OR condition?
yes, just use or. Like so contains(@class, 'tablesortcolumn') or contains(etc....)
I find that if you are dealing with structured data, it's a lot easier than regex
if you are dealing with structured data, it's a lot easier than regex. That's a statement that is too general to be useful. Also, //*[text()[contains(.,'@')]] is unwieldy, please change it to //*[contains(text(),'@')]
good point. and yes I like yours better; I've just seen problems with that in case there's another nested node, as can occur commonly in html, such as a line break. Also re: regex... I just said that because it's a personal opinion, and also largely because regex makes me dizzy ;)
1

The following XPath expression does exactly what you want

//*[@class = 'tablesortcolumn' or contains(text(),'@')]

using the input document you have shown will yield (individual results separated by -------------):

<td class="tablesortcolumn">atest</td>
-----------------------
<a href="">[email protected]</a>

1 Comment

This is a better answer than mine for this situation
1

If you are looking for an email field, you could use a regex. Here is an article that could be useful.

EDIT

According to Nisse Engström, I will put the interesting part of the article here in case the blog goes down. Thanks for the advice.

// Supress XML parsing errors (this is needed to parse Wikipedia's XHTML)
libxml_use_internal_errors(true);

// Load the PHP Wikipedia article
$domDoc = new DOMDocument();
$domDoc->load('http://en.wikipedia.org/wiki/PHP');

// Create XPath object and register the XHTML namespace
$xPath = new DOMXPath($domDoc);
$xPath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');

// Register the PHP namespace if you want to call PHP functions
$xPath->registerNamespace('php', 'http://php.net/xpath');

// Register preg_match to be available in XPath queries 
//
// You can also pass an array to register multiple functions, or call 
// registerPhpFunctions() with no parameters to register all PHP functions
$xPath->registerPhpFunctions('preg_match');

// Find all external links in the article  
$regex = '@^http://[^/]+(?<!wikipedia.org)/@';
$links = $xPath->query("//html:a[ php:functionString('preg_match', '$regex', @href) > 0 ]");

// Print out matched entries
echo "Found " . (int) $links->length . " external linksnn";
foreach($links as $linkDom) { /* @var $entry DOMElement */
    $link = simplexml_import_dom($linkDom);
    $desc = (string) $link;
    $href = (string) $link['href'];

    echo " - ";
    if ($desc && $desc != $href) {
        echo "$desc: ";
    } 
    echo "$href\n";
}

Comments

0

If you are using Chrome, you can test your XPath queries in the console, like this :

$x("//*[contains(@class, 'tablesortcolumn')]")

2 Comments

This answer describes a way to test XPath expressions, but it does not answer the question. By the way: also works in Firefox.
I know but this is a tip ... Vince use the good method to do this, he just need the right query.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.