How to get string from HTML with regex?

Question

I'm trying to parse block from html page so i try to preg_match this block with php

if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t))

but doesn't work

</div>

blablabla

blablabla

blablabla

<div class="adsdiv">

i want grep only blablabla blablabla words any help

please, describe exactly what html text you would like to match? — ULysses
– ULysses, Commented Jul 21, 2010 at 10:38

Community · Accepted Answer · 2017-05-23 12:03:20Z

Regex aint the right tool for this. Here is how to do it with DOM

$html = <<< HTML
<div class="parent">
    <div>
        <p>previous div<p>
    </div>
    blablabla
    blablabla
    blablabla
    <div class="adsdiv">
        <p>other content</p>
    </div>
</div>
HTML;

Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div with class attribute of adsdiv

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[@class="adsdiv"]');
foreach($nodes as $node) {
    foreach($node->parentNode->childNodes as $child) {
        if($child instanceof DOMText) {
            echo $child->nodeValue;
        }
    };
}

Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to

$nodes = $xPath->query('//div[@class="adsdiv"]/../text()');
foreach($nodes as $node) {
    echo $node->nodeValue;
}

I kept it deliberatly verbose to illustrate how to use DOM though.

mvds · Accepted Answer · 2010-07-21 13:39:34Z

1

Apart from what has been said above, also add the /s modifier so . will match newlines. (edit: as Alan kindly pointed out, [^<]+ will match newlines anyway)

I always use /U as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i since people say <div>, <DIV>, or even <Div>...

if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
    echo "Found: ".$match[1]."<br>";
} else {
    echo "Not found<br>";
}

edit made it a little more explicit!

edited Jul 21, 2010 at 13:39

answered Jul 21, 2010 at 10:46

mvds

47.4k8 gold badges104 silver badges113 bronze badges

4 Comments

normand Over a year ago

thanks mvds for reply but it reply with empty result meaning not work

mvds Over a year ago

Ok I added a little code which shows how to get the matched portion out of it. This should work (although, it requires that the input is exactly what you are showing; i.e. not some formatted html by firefox-like "view source"!)

Alan Moore Over a year ago

[^<] will match newlines whether you use the /s modifier or not.

Alan Moore Over a year ago

And I recommend NOT getting in the habit of using the /U modifier. It's better to get out of the habit of using .*. Reluctant quantifiers speed up matching by avoiding excessive backtracking, but you already took care of that by using [^<]+ instead of .*. If anything, the /U is slowing you down, because character-for-character, reluctant quantifiers are slower than greedy ones.

Alix Axel · Accepted Answer · 2010-07-21 10:44:59Z

0

From the PHP Manual:

s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

So, the following should work:

if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))

The ~ are there to delimit the regular expression.

answered Jul 21, 2010 at 10:44

Alix Axel

155k99 gold badges406 silver badges508 bronze badges

Comments

user11977 · Accepted Answer · 2010-07-21 10:38:59Z

-1

You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

answered Jul 21, 2010 at 10:38

user11977

1,79311 silver badges13 bronze badges

1 Comment

Alan Moore Over a year ago

Although it doesn't solve the OP's problem, this is a valid point. The regex in the question lacks delimiters and will throw an exception if you try to use it.

Collectives™ on Stack Overflow

How to get string from HTML with regex?

4 Answers 4

Comments

4 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related