Html parsing using Simple Html dom parser

Question

I am using simple html dom parser to parse some html.

I have an html like this

<span class="UIStory_Message">
    Yeah, elixir of life!<br/>
   <a href="asdfasdf">
      <span>asdfsdfasdfsdf</span>
       <wbr/>
       <span class="word_break"/>
       61193133389&ref=nf
   </a>
</span>

My code is

$storyMessageNodes    = $story->find('span.UIStory_Message');
$storyMessage         = strip_tags($storyMessageNodest->innertext);

I want to get the text right inside the span "UIStory_Message". ie, "Yeah, elixir of life!".

but the above code gives me the whole text which is inside the whole span. ie, "Yeah, elixir of life! asdfsdfasdfsdf 61193133389&ref=nf "

how could i code so that it gives only "Yeah, elixir of life!" ??

raveren · Accepted Answer · 2010-09-22 14:17:15Z

I've written a method to get rid of unneeded elements in fetched DOM nodes, I've contacted the author, but simple dom has not been active for two years so I doubt he will include it in the distribution. Here it is:

/**
 * remove specified nodes from selected dom
 *
 * @param string $selector
 * @param int|array (optional) possible values include:
 *   + positive integer - remove first denoted number of elements
 *   + negative integer - remove last denoted number of elements
 *   + array of ones and zeroes - remove the respective matches that equal to one
 *
 * eg.
 *   // will remove first two images found in node
 *   $dom->removeNodes('img',2);
 *
 *   // will remove last two images found in node
 *   $dom->removeNodes('img',-2);
 *
 *   // will remove all but the third images found in node
 *   $dom->removeNodes('img',array(1,1,0,1));
 *
 * [!!!] if there are more matches found than elements in array, the last array member will be used for processing
 *
 * eg.
 *   // will remove second and every following image
 *   $dom->removeNodes('img',array(0,1));
 *
 *   // will remove only the second image
 *   $dom->removeNodes('img',array(0,1,0));
 *
 * @return simple_html_dom_node
 */
public function removeNodes($selector, $limit = NULL)
{
    $elements = $this->find($selector);
    if ( empty($elements) ) return $this;


    if ( isset($limit) && is_int( $limit ) && $limit < 0 ) {
        $limit = abs( $limit );
        $elements = array_reverse( $elements );
    }

    foreach ( $elements as $element ) {

        if ( isset($limit) ) {

            if ( is_array( $limit ) ) {
                $current = current( $limit );
                if ( next( $limit ) === FALSE ) {
                    end( $limit );
                }
                if ( !$current ) {
                    continue;
                }
            } else {
                if ( --$limit === -1 ) {
                    return $this;
                }
            }
        }

        $element->outertext = '';

    }

    return $this;
}

put it in simple_html_dom_node class or one extending it. In the askers case you'd use it like this:

$storyMessageNodes = $story->find('span.UIStory_Message');
$storyMessage = $storyMessageNodes[0]->removeNodes('a')->plaintext

How can I get this function to remove the whole element including the innertext of the element not just the element tags?

Alix Axel · Accepted Answer · 2009-12-24 06:14:12Z

1

You can do something like this:

$result = $story->find('span.UIStory_Message');

And then substr() on the first <; one other option is to write a simple regular expression.

I've not tested, this is just a wild guess based on the documentation, try doing:

$story->find('span.UIStory_Message')->plaintext; // same result as strip_tags()?

Or:

$story->find('span.UIStory_Message')->find('text');

If that doesn't work, try playing with these options.

edited Dec 24, 2009 at 6:14

answered Dec 24, 2009 at 5:39

Alix Axel

155k99 gold badges406 silver badges508 bronze badges

1 Comment

Andromeda Over a year ago

I know tht will work.... but iwant to know if there is any direct methods in simple_html_dom.php for doing this??

Dr. Reshef · Accepted Answer · 2012-07-19 07:00:35Z

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.

here is an example function:

public function removeNode($selector)
{
    foreach ($html->find($selector) as $node)
    {
        $node->outertext = '';
    }

    $this->load($this->save());        
}

put this function inside the simple_html_dom class and you're good.

Collectives™ on Stack Overflow

Html parsing using Simple Html dom parser

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related