2

I'm trying to parse block from html page so i try to preg_match this block with php

if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t)) 

but doesn't work

</div>

blablabla

blablabla

blablabla

<div class="adsdiv">

i want grep only blablabla blablabla words any help

1
  • please, describe exactly what html text you would like to match? Commented Jul 21, 2010 at 10:38

4 Answers 4

1

Regex aint the right tool for this. Here is how to do it with DOM

$html = <<< HTML
<div class="parent">
    <div>
        <p>previous div<p>
    </div>
    blablabla
    blablabla
    blablabla
    <div class="adsdiv">
        <p>other content</p>
    </div>
</div>
HTML;

Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div with class attribute of adsdiv

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[@class="adsdiv"]');
foreach($nodes as $node) {
    foreach($node->parentNode->childNodes as $child) {
        if($child instanceof DOMText) {
            echo $child->nodeValue;
        }
    };
}

Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to

$nodes = $xPath->query('//div[@class="adsdiv"]/../text()');
foreach($nodes as $node) {
    echo $node->nodeValue;
}

I kept it deliberatly verbose to illustrate how to use DOM though.

Sign up to request clarification or add additional context in comments.

Comments

1

Apart from what has been said above, also add the /s modifier so . will match newlines. (edit: as Alan kindly pointed out, [^<]+ will match newlines anyway)

I always use /U as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i since people say <div>, <DIV>, or even <Div>...

if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
    echo "Found: ".$match[1]."<br>";
} else {
    echo "Not found<br>";
}

edit made it a little more explicit!

4 Comments

thanks mvds for reply but it reply with empty result meaning not work
Ok I added a little code which shows how to get the matched portion out of it. This should work (although, it requires that the input is exactly what you are showing; i.e. not some formatted html by firefox-like "view source"!)
[^<] will match newlines whether you use the /s modifier or not.
And I recommend NOT getting in the habit of using the /U modifier. It's better to get out of the habit of using .*. Reluctant quantifiers speed up matching by avoiding excessive backtracking, but you already took care of that by using [^<]+ instead of .*. If anything, the /U is slowing you down, because character-for-character, reluctant quantifiers are slower than greedy ones.
0

From the PHP Manual:

s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

So, the following should work:

if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))

The ~ are there to delimit the regular expression.

Comments

-1

You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

1 Comment

Although it doesn't solve the OP's problem, this is a valid point. The regex in the question lacks delimiters and will throw an exception if you try to use it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.