PHP - extract data from a web page HTML

Question

I need to extract the words FIESTA ERASMUS ans /event/83318 in the following HTML code

    <div id="tab-soiree" class=""><div class="soireeagenda cat_1">  
            <a href="/event/83318/" class="lienFly"><img src="http://www.parisbouge.com/img/fly/resize/100/83318.jpg" alt="fiesta erasmus" class="fly"></a>
                <ul>
                    <li class="nom"><h2><a href="/event/83318/">FIESTA ERASMUS</a> </h2></li>
                    <li class="genre" style="margin-bottom:4px;">
                    <a href="/soirees-etudiantes/paris/1/" style="color:inherit;" title="soirée étudiante">soirée étudiante</a>             </li>
                    <li class="lieu"><a href="/club/paris/10/duplex">Duplex</a></li>                <li class="musique">house, electro, r&b chic, latino, disco</li>
                    <li class="pass-label">pass</li>                </ul>
                      <a href="/club/paris/10/duplex" title="duplex"><img src="/img/salles/resize/50/10.jpg" alt="duplex" class="flysalle"></a>
                 <hr class="clearleft">
        </div>

I tested something like this

$PATTERN = "/\<div id="tab-soiree".*<a href="/event/(.*)/">(.*)</a>/"
preg_match($PATTERN, $html, $matches);

but it doesnt work.

You can't use regexes to parse HTML so use a DOM parser instead :) — Daan
– Daan, Commented Apr 30, 2012 at 15:23

Sampson · Accepted Answer · 2012-04-30 15:31:22Z

2

You don't parse HTML with Regular Expressions. Instead, use the built-in DOM parsing tools within PHP itself: http://php.net/manual/en/book.dom.php

Assuming your HTML is accessible from a variable named $html:

$doc = new DOMDocument();
$doc->loadHTML( $html );

$item = $doc->getElementsByTagName("li")->item(0);
$link = $item->getElementsByTagName("a")->item(0);

echo $link->attributes->getNamedItem('href')->nodeValue;
echo $link->textContent;

edited Apr 30, 2012 at 15:31

answered Apr 30, 2012 at 15:24

Sampson

269k76 gold badges546 silver badges570 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

geekInside Over a year ago

Thank you. I will try with the DOMDocument

ᴍᴇʜᴏᴠ · Accepted Answer · 2012-04-30 15:27:29Z

1

I suggest the following pattern:

$PATTERN = '%<h2><a href="(.*?)">(.*?)</a>[\s]+</h2>%i';
preg_match($PATTERN, $html, $matches);

The (.*?) part is a non-greedy pattern, which means that the parser won't go all the way to the end of the supplied string but will stop before the " in this case.

You may also want to pre-proccess the html before REGEX'ing it, i.e. remove all line-breaks in order to get rid of the [\s]+ part.

You can try it online here.

answered Apr 30, 2012 at 15:27

ᴍᴇʜᴏᴠ

5,3494 gold badges48 silver badges61 bronze badges

1 Comment

geekInside Over a year ago

What do you advice me : DOM or Regex ?

Collectives™ on Stack Overflow

PHP - extract data from a web page HTML

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related