PHP Simple HTML DOM Parser find string

Question

I am using PHP simple DOM parser but it does not seem to have the functionality to search for text. I need to search for a string and find the parent id for it. Essentially the reverse of normal usage.

Anyone know how?

karim79 · Accepted Answer · 2011-03-28 22:21:23Z

9

$html = file_get_html('http://www.google.com/');

$eles = $html->find('*');
foreach($eles as $e) {
    if(strpos($e->innertext, 'theString') !== false) {
        echo $e->id;
    }
}

http://simplehtmldom.sourceforge.net/manual.htm

answered Mar 28, 2011 at 22:21

karim79

343k67 gold badges420 silver badges409 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

karim79 Over a year ago

$e->id is the Simple DOM way to get the ID attribute. Perhaps try changing $eles = $html->find('*'); to $eles = $html->find('p, div'); or something.

Charlie Over a year ago

is it not getAttribute('id') ... I can't get it to work regardless :S

wake-spb · Accepted Answer · 2015-07-05 17:04:16Z

6

Just imagine that any tag has a "plaintext" attribute and use standart attribute selectors.

So, HTML:

<div id="div1">
  <span>London is the capital</span> of Great Britain
</div>
<div id="div2">
  <span>Washington is the capital</span> of the USA
</div>

can be imagined in mind as:

<div id="div1" plaintext="London is the capital  of Great Britain">
  <span plaintext="London is the capital ">London is the capital</span> of Great Britain
</div>
<div id="div2" plaintext="Washington is the capital  of the USA">
  <span plaintext="Washington is the capital ">Washington is the capital</span> of the USA
</div>

And PHP to resolve your task is just:

<?php
  $t = '
    <div id="div1">
      <span>London is the capital</span> of Great Britain
    </div>
    <div id="div2">
      <span>Washington is the capital</span> of the USA
    </div>';
  $html = str_get_html($t);
  $foo = $html->find('span[plaintext^=London]');
  echo "ID: " . $foo[0]->parent()->id; // div1
?>

(take in mind that "plaintext" for <span> tags is right-padded with a space symbol; this is default behaviour of Simple HTML DOM, defined by constant DEFAULT_SPAN_TEXT)

edited Jul 5, 2015 at 17:04

answered Jul 5, 2015 at 16:35

wake-spb

611 silver badge3 bronze badges

1 Comment

electroid Over a year ago

so far the best answer

Wrikken · Accepted Answer · 2011-03-28 23:53:19Z

3

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
$result = $x->evaluate("//text()[contains(.,'617.99')]/ancestor::*/@id");
$unique = null;
for($i = $result->length -1;$i >= 0 && $item = $result->item($i);$i--){
    if($x->query("//*[@id='".addslashes($item->value)."']")->length == 1){
        echo 'Unique ID is '.$item->value."\n";
            $unique = $item->value;
        break;
    }
}
if(is_null($unique)) echo 'no unique ID found';

edited Mar 28, 2011 at 23:53

answered Mar 28, 2011 at 22:19

Wrikken

70.8k8 gold badges99 silver badges136 bronze badges

9 Comments

jrn.ak Over a year ago

This is PHP's DOMDocument, not the SimpleHTMLDom Library as the OP stated he was using.

Wrikken Over a year ago

Ack, missed that. Still can't get my head around people using that slow, slow thingamajig, but you're right, this isn't the answer the OP is looking for then.

Wrikken Over a year ago

Sure there is, before loading, set $d->recover = true;$d->strictErrorChecking = false;, and of course, use loadHTML() instead of loadXML() for HTML. If you still get to much errors, which you cannot ignore (never display errors on production sites), you could set libxml_use_internal_errors(true); to handle them seperately from other PHP errors.

Wrikken Over a year ago

Ack, wrapper is not what we want :). My bad, my XPath is a bit rusty, try //text()[contains(.,'617.99')]/parent::*/@id, seems to work here.

Wrikken Over a year ago

Warnings can be disabled by either prepeding @ (@$d->loadHTML($html);, which is kinda evil, or using libxml_use_internal_errors(true);$d->loadHTML($html);libxml_clear_errors(); (preferred IMHO). An id should be unique, but we all know it's sometimes not. You can check with $x->query("//*[@id='theid']")->length == 1 (for priceIncTaxSpan3047 it is, but look at the 50 Table_01's, no wonder DOMDocument protests :)

|

akeane · Accepted Answer · 2011-07-01 23:29:51Z

Got the answer. The entire example is a little long but it works. I also show the output.

The HTML for what we are going to look at:

<html>
<head>
<title>Simple HTML DOM - Find Text</title>
</head>
<body>
<h3>Simple HTML DOM - Find Text</h3>
<div id="first">
 <p>This is a paragraph inside of div 'first'.
   This paragraph does not have the text we are looking for.</p>
 <p>As a matter of fact this div does not have the text we are looking for</p>
</div>
<div id="second">
 <ul>
  <li>This is an unordered list.
  <li id="love1">We are looking for the following word love.
  <li>Does not contain the word.
 </ul>
 <p id="love2">This paragraph which is in div second contains the word love.</p>
</div>
<div id="third">
 <a id="love3" href="goes.nowhere.com">link to love site</a>
</div>
</body>
</html>

The PHP:

<?php
include_once('simple_html_dom.php');

function scraping_for_text($iUrl,$iText)
{
echo "iUrl=".$iUrl."<br />";
echo "iText=".$iText."<br />";

    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aObj = $html->find('text');
    if (count($aObj) > 0)
    {
       echo "<h4>Found ".$iText."</h4>";
    }
    else
    {
       echo "<h4>No ".$iText." found"."</h4>";
    }
    foreach ($aObj as $key=>$oLove)
    {
      $plaintext = $oLove->plaintext;
      if (strpos($plaintext,$iText) !== FALSE)
      {
         echo $key.": text=".$plaintext."<br />"
              ."--- parent tag=".$oLove->parent()->tag."<br />"
              ."--- parent id=".$oLove->parent()->id."<br />";
      }
    }

    // clean up memory
    $html->clear();
    unset($html);

    return;
}

// -------------------------------------------------------------
// test it!

// user_agent header...
ini_set('user_agent', 'My-Application/2.5');

scraping_for_text("test_text.htm","love");
?>

The output:

iUrl=test_text.htm
iText=love
Found love
18: text=We are looking for the following word love.
--- parent tag=li
--- parent id=love1
21: text=This paragraph which is in div second contains the word love.
--- parent tag=p
--- parent id=love2
25: text=link to love site
--- parent tag=a
--- parent id=love3

That's all they wrote!!!!

Great example. Would you know how to go from text, back to an element? I want to search by text and then find the nearest element. It's from an old table layout without any classes or IDs.

Collectives™ on Stack Overflow

PHP Simple HTML DOM Parser find string

4 Answers 4

2 Comments

1 Comment

9 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

9 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related