I am using PHP simple DOM parser but it does not seem to have the functionality to search for text. I need to search for a string and find the parent id for it. Essentially the reverse of normal usage.
Anyone know how?
$html = file_get_html('http://www.google.com/');
$eles = $html->find('*');
foreach($eles as $e) {
if(strpos($e->innertext, 'theString') !== false) {
echo $e->id;
}
}
Just imagine that any tag has a "plaintext" attribute and use standart attribute selectors.
So, HTML:
<div id="div1">
<span>London is the capital</span> of Great Britain
</div>
<div id="div2">
<span>Washington is the capital</span> of the USA
</div>
can be imagined in mind as:
<div id="div1" plaintext="London is the capital of Great Britain">
<span plaintext="London is the capital ">London is the capital</span> of Great Britain
</div>
<div id="div2" plaintext="Washington is the capital of the USA">
<span plaintext="Washington is the capital ">Washington is the capital</span> of the USA
</div>
And PHP to resolve your task is just:
<?php
$t = '
<div id="div1">
<span>London is the capital</span> of Great Britain
</div>
<div id="div2">
<span>Washington is the capital</span> of the USA
</div>';
$html = str_get_html($t);
$foo = $html->find('span[plaintext^=London]');
echo "ID: " . $foo[0]->parent()->id; // div1
?>
(take in mind that "plaintext" for <span> tags is right-padded with a space symbol; this is default behaviour of Simple HTML DOM, defined by constant DEFAULT_SPAN_TEXT)
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
$result = $x->evaluate("//text()[contains(.,'617.99')]/ancestor::*/@id");
$unique = null;
for($i = $result->length -1;$i >= 0 && $item = $result->item($i);$i--){
if($x->query("//*[@id='".addslashes($item->value)."']")->length == 1){
echo 'Unique ID is '.$item->value."\n";
$unique = $item->value;
break;
}
}
if(is_null($unique)) echo 'no unique ID found';
DOMDocument, not the SimpleHTMLDom Library as the OP stated he was using.$d->recover = true;$d->strictErrorChecking = false;, and of course, use loadHTML() instead of loadXML() for HTML. If you still get to much errors, which you cannot ignore (never display errors on production sites), you could set libxml_use_internal_errors(true); to handle them seperately from other PHP errors.wrapper is not what we want :). My bad, my XPath is a bit rusty, try //text()[contains(.,'617.99')]/parent::*/@id, seems to work here.@ (@$d->loadHTML($html);, which is kinda evil, or using libxml_use_internal_errors(true);$d->loadHTML($html);libxml_clear_errors(); (preferred IMHO). An id should be unique, but we all know it's sometimes not. You can check with $x->query("//*[@id='theid']")->length == 1 (for priceIncTaxSpan3047 it is, but look at the 50 Table_01's, no wonder DOMDocument protests :)Got the answer. The entire example is a little long but it works. I also show the output.
The HTML for what we are going to look at:
<html>
<head>
<title>Simple HTML DOM - Find Text</title>
</head>
<body>
<h3>Simple HTML DOM - Find Text</h3>
<div id="first">
<p>This is a paragraph inside of div 'first'.
This paragraph does not have the text we are looking for.</p>
<p>As a matter of fact this div does not have the text we are looking for</p>
</div>
<div id="second">
<ul>
<li>This is an unordered list.
<li id="love1">We are looking for the following word love.
<li>Does not contain the word.
</ul>
<p id="love2">This paragraph which is in div second contains the word love.</p>
</div>
<div id="third">
<a id="love3" href="goes.nowhere.com">link to love site</a>
</div>
</body>
</html>
The PHP:
<?php
include_once('simple_html_dom.php');
function scraping_for_text($iUrl,$iText)
{
echo "iUrl=".$iUrl."<br />";
echo "iText=".$iText."<br />";
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('text');
if (count($aObj) > 0)
{
echo "<h4>Found ".$iText."</h4>";
}
else
{
echo "<h4>No ".$iText." found"."</h4>";
}
foreach ($aObj as $key=>$oLove)
{
$plaintext = $oLove->plaintext;
if (strpos($plaintext,$iText) !== FALSE)
{
echo $key.": text=".$plaintext."<br />"
."--- parent tag=".$oLove->parent()->tag."<br />"
."--- parent id=".$oLove->parent()->id."<br />";
}
}
// clean up memory
$html->clear();
unset($html);
return;
}
// -------------------------------------------------------------
// test it!
// user_agent header...
ini_set('user_agent', 'My-Application/2.5');
scraping_for_text("test_text.htm","love");
?>
The output:
iUrl=test_text.htm
iText=love
Found love
18: text=We are looking for the following word love.
--- parent tag=li
--- parent id=love1
21: text=This paragraph which is in div second contains the word love.
--- parent tag=p
--- parent id=love2
25: text=link to love site
--- parent tag=a
--- parent id=love3
That's all they wrote!!!!