2

I am using simple html dom parser to parse some html.

I have an html like this

<span class="UIStory_Message">
    Yeah, elixir of life!<br/>
   <a href="asdfasdf">
      <span>asdfsdfasdfsdf</span>
       <wbr/>
       <span class="word_break"/>
       61193133389&ref=nf
   </a>
</span>

My code is

$storyMessageNodes    = $story->find('span.UIStory_Message');
$storyMessage         = strip_tags($storyMessageNodest->innertext);

I want to get the text right inside the span "UIStory_Message". ie, "Yeah, elixir of life!".

but the above code gives me the whole text which is inside the whole span. ie, "Yeah, elixir of life! asdfsdfasdfsdf 61193133389&ref=nf "

how could i code so that it gives only "Yeah, elixir of life!" ??

3 Answers 3

5

I've written a method to get rid of unneeded elements in fetched DOM nodes, I've contacted the author, but simple dom has not been active for two years so I doubt he will include it in the distribution. Here it is:

/**
 * remove specified nodes from selected dom
 *
 * @param string $selector
 * @param int|array (optional) possible values include:
 *   + positive integer - remove first denoted number of elements
 *   + negative integer - remove last denoted number of elements
 *   + array of ones and zeroes - remove the respective matches that equal to one
 *
 * eg.
 *   // will remove first two images found in node
 *   $dom->removeNodes('img',2);
 *
 *   // will remove last two images found in node
 *   $dom->removeNodes('img',-2);
 *
 *   // will remove all but the third images found in node
 *   $dom->removeNodes('img',array(1,1,0,1));
 *
 * [!!!] if there are more matches found than elements in array, the last array member will be used for processing
 *
 * eg.
 *   // will remove second and every following image
 *   $dom->removeNodes('img',array(0,1));
 *
 *   // will remove only the second image
 *   $dom->removeNodes('img',array(0,1,0));
 *
 * @return simple_html_dom_node
 */
public function removeNodes($selector, $limit = NULL)
{
    $elements = $this->find($selector);
    if ( empty($elements) ) return $this;


    if ( isset($limit) && is_int( $limit ) && $limit < 0 ) {
        $limit = abs( $limit );
        $elements = array_reverse( $elements );
    }

    foreach ( $elements as $element ) {

        if ( isset($limit) ) {

            if ( is_array( $limit ) ) {
                $current = current( $limit );
                if ( next( $limit ) === FALSE ) {
                    end( $limit );
                }
                if ( !$current ) {
                    continue;
                }
            } else {
                if ( --$limit === -1 ) {
                    return $this;
                }
            }
        }

        $element->outertext = '';

    }

    return $this;
}

put it in simple_html_dom_node class or one extending it. In the askers case you'd use it like this:

$storyMessageNodes = $story->find('span.UIStory_Message');
$storyMessage = $storyMessageNodes[0]->removeNodes('a')->plaintext
Sign up to request clarification or add additional context in comments.

1 Comment

How can I get this function to remove the whole element including the innertext of the element not just the element tags?
1

You can do something like this:

$result = $story->find('span.UIStory_Message');

And then substr() on the first <; one other option is to write a simple regular expression.


I've not tested, this is just a wild guess based on the documentation, try doing:

$story->find('span.UIStory_Message')->plaintext; // same result as strip_tags()?

Or:

$story->find('span.UIStory_Message')->find('text');

If that doesn't work, try playing with these options.

1 Comment

I know tht will work.... but iwant to know if there is any direct methods in simple_html_dom.php for doing this??
0

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.

here is an example function:

public function removeNode($selector)
{
    foreach ($html->find($selector) as $node)
    {
        $node->outertext = '';
    }

    $this->load($this->save());        
}

put this function inside the simple_html_dom class and you're good.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.