Extract two string in html code

Question

I have a HTML table which has the following structure:

<tr>
    <td class='tablesortcolumn'>atest</td>
    <td >Kunde</td>
    <td ><a href="">[email protected]</a></td>
    <td align="right"><a href="module/dns_reseller/user_edit.php?ns=3&uid=6952"><img src="images/iconedit.gif" border="0"/></a> <img src="images/pixel.gif" width="2" height="1" border="0"/> <a href="module/dns_reseller/user.php?delete=true&uid=6952" onclick="return confirm('Möchten Sie den Datensatz wirklich löschen?');"><img src="images/icontrash.gif" border="0"/></a></td>
</tr>

There are hundreds of these tr blocks.

I want to extract atest and [email protected]

I tried the following:

$document = new DOMDocument();
$document->loadHTML($data);
$selector = new DOMXPath($document);
$elements = $selector->query("//*[contains(@class, 'tablesortcolumn')]");

foreach($elements as $element) {
  $text = $element->nodeValue;
  print($text);
  print('<br>');
}

Extracting atest is no problem, because I can get the element with the tablesortcolumn class. How can I get the email address?

I cannot simply use //table/tr/td/a because there are other elements on the website which are structured like this. So I need to get it by choosing an empty href tag. I already tried //table/tr/td/a[contains(@href, '')] but it returns the same as with //table/tr/td/a

Does anyone have an idea how to solve this?

an xpath axis like following-sibling could have perhaps helped you with that, too, if the email TD is always two TDs after the "atest" TD. Just saying. — hakre
– hakre, Commented Apr 15, 2015 at 18:11

nomistic · Accepted Answer · 2015-04-14 15:21:29Z

2

can you try running an xpath that contains the string @? It seems unlikely that this would be used for anything else.

so something like this might work

//*[text()[contains(.,'@')]]

edited Apr 14, 2015 at 15:21

answered Apr 14, 2015 at 15:15

nomistic

2,9624 gold badges22 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Vince Over a year ago

That works! Thank you. Now, how can I combine the atest with [email protected]? Is there something like an OR condition?

nomistic Over a year ago

yes, just use or. Like so contains(@class, 'tablesortcolumn') or contains(etc....)

nomistic Over a year ago

I find that if you are dealing with structured data, it's a lot easier than regex

Mathias Müller Over a year ago

if you are dealing with structured data, it's a lot easier than regex. That's a statement that is too general to be useful. Also, //*[text()[contains(.,'@')]] is unwieldy, please change it to //*[contains(text(),'@')]

nomistic Over a year ago

good point. and yes I like yours better; I've just seen problems with that in case there's another nested node, as can occur commonly in html, such as a line break. Also re: regex... I just said that because it's a personal opinion, and also largely because regex makes me dizzy ;)

Mathias Müller · Accepted Answer · 2015-04-14 15:23:10Z

1

The following XPath expression does exactly what you want

//*[@class = 'tablesortcolumn' or contains(text(),'@')]

using the input document you have shown will yield (individual results separated by -------------):

<td class="tablesortcolumn">atest</td>
-----------------------
<a href="">[email protected]</a>

answered Apr 14, 2015 at 15:23

Mathias Müller

22.7k13 gold badges62 silver badges78 bronze badges

1 Comment

nomistic Over a year ago

This is a better answer than mine for this situation

richerlariviere · Accepted Answer · 2015-04-14 15:52:45Z

If you are looking for an email field, you could use a regex. Here is an article that could be useful.

EDIT

According to Nisse Engström, I will put the interesting part of the article here in case the blog goes down. Thanks for the advice.

// Supress XML parsing errors (this is needed to parse Wikipedia's XHTML)
libxml_use_internal_errors(true);

// Load the PHP Wikipedia article
$domDoc = new DOMDocument();
$domDoc->load('http://en.wikipedia.org/wiki/PHP');

// Create XPath object and register the XHTML namespace
$xPath = new DOMXPath($domDoc);
$xPath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');

// Register the PHP namespace if you want to call PHP functions
$xPath->registerNamespace('php', 'http://php.net/xpath');

// Register preg_match to be available in XPath queries 
//
// You can also pass an array to register multiple functions, or call 
// registerPhpFunctions() with no parameters to register all PHP functions
$xPath->registerPhpFunctions('preg_match');

// Find all external links in the article  
$regex = '@^http://[^/]+(?<!wikipedia.org)/@';
$links = $xPath->query("//html:a[ php:functionString('preg_match', '$regex', @href) > 0 ]");

// Print out matched entries
echo "Found " . (int) $links->length . " external linksnn";
foreach($links as $linkDom) { /* @var $entry DOMElement */
    $link = simplexml_import_dom($linkDom);
    $desc = (string) $link;
    $href = (string) $link['href'];

    echo " - ";
    if ($desc && $desc != $href) {
        echo "$desc: ";
    } 
    echo "$href\n";
}

Dr. Z · Accepted Answer · 2015-04-14 15:18:46Z

0

If you are using Chrome, you can test your XPath queries in the console, like this :

$x("//*[contains(@class, 'tablesortcolumn')]")

answered Apr 14, 2015 at 15:18

Dr. Z

2373 silver badges20 bronze badges

2 Comments

Mathias Müller Over a year ago

This answer describes a way to test XPath expressions, but it does not answer the question. By the way: also works in Firefox.

Dr. Z Over a year ago

I know but this is a tip ... Vince use the good method to do this, he just need the right query.

Collectives™ on Stack Overflow

Extract two string in html code

4 Answers 4

5 Comments

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related