0

I need to extract the words FIESTA ERASMUS ans /event/83318 in the following HTML code

    <div id="tab-soiree" class=""><div class="soireeagenda cat_1">  
            <a href="/event/83318/" class="lienFly"><img src="http://www.parisbouge.com/img/fly/resize/100/83318.jpg" alt="fiesta erasmus" class="fly"></a>
                <ul>
                    <li class="nom"><h2><a href="/event/83318/">FIESTA ERASMUS</a> </h2></li>
                    <li class="genre" style="margin-bottom:4px;">
                    <a href="/soirees-etudiantes/paris/1/" style="color:inherit;" title="soirée étudiante">soirée étudiante</a>             </li>
                    <li class="lieu"><a href="/club/paris/10/duplex">Duplex</a></li>                <li class="musique">house, electro, r&b chic, latino, disco</li>
                    <li class="pass-label">pass</li>                </ul>
                      <a href="/club/paris/10/duplex" title="duplex"><img src="/img/salles/resize/50/10.jpg" alt="duplex" class="flysalle"></a>
                 <hr class="clearleft">
        </div>

I tested something like this

$PATTERN = "/\<div id="tab-soiree".*<a href="/event/(.*)/">(.*)</a>/"
preg_match($PATTERN, $html, $matches);

but it doesnt work.

1

2 Answers 2

2

You don't parse HTML with Regular Expressions. Instead, use the built-in DOM parsing tools within PHP itself: http://php.net/manual/en/book.dom.php

Assuming your HTML is accessible from a variable named $html:

$doc = new DOMDocument();
$doc->loadHTML( $html );

$item = $doc->getElementsByTagName("li")->item(0);
$link = $item->getElementsByTagName("a")->item(0);

echo $link->attributes->getNamedItem('href')->nodeValue;
echo $link->textContent;
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. I will try with the DOMDocument
1

I suggest the following pattern:

$PATTERN = '%<h2><a href="(.*?)">(.*?)</a>[\s]+</h2>%i';
preg_match($PATTERN, $html, $matches);

The (.*?) part is a non-greedy pattern, which means that the parser won't go all the way to the end of the supplied string but will stop before the " in this case.

You may also want to pre-proccess the html before REGEX'ing it, i.e. remove all line-breaks in order to get rid of the [\s]+ part.

You can try it online here.

1 Comment

What do you advice me : DOM or Regex ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.