3

I use php web scraping, and I want to get the price (3.65) on Sunday form the html code below:

     <tr class="odd">
       <td >
           <b>Sunday</b> Info
           <div class="test">test</div>
       </td>
       <td>
       &euro; 3.65 *

       </td>
    </tr>

But I don't find the best regex to do this... I use this php code:

    <?php
        $data = file_get_contents('http://www.test.com/');

        preg_match('/<tr class="odd"><td ><b>Sunday</b> Info<div class="test">test<\/div><\/td><td>&euro; (.*) *<\/td><\/tr>/i', $data, $matches);
        $result = $matches[1];
    ?>

But no result... What's wrong in the regex? (I think it's because of the new lines/spaces?)

4
  • regex on "&euro; ([0-9.]*) " instead to get the price. If it's among others, you could split() it first. Watch out for special regex characters too, like the obvious * after the price! Commented Aug 6, 2012 at 11:55
  • But I also need to use the "Sunday", because there are also other days... Commented Aug 6, 2012 at 11:58
  • /Sunday(.*)&euro; ([0-9.]*)/s will give me the longest possible answer, is there a way to get the shortest answer? If that's possible, that could work... Commented Aug 6, 2012 at 12:28
  • If you don't have permission to scrape from the site, then don't do it. If you do have permission, then ask for a pricelist feed in XML, which will be designed for data extraction. Commented Aug 6, 2012 at 18:29

5 Answers 5

6

Don't use regular expressions, HTML is not regular.

Instead, use a DOM Tree parser like DOMDocument. This documentation may help you.

The /s switch should help you with your original regex though I haven't tried it.

Sign up to request clarification or add additional context in comments.

Comments

3

The problems are the spaces between the tags. there a line breaks, tabs and/or spaces.

your regex doesn't match to them.

you also need to setup your preg_match for multiline!

i think it is more easy to use xpath for scraping.

Comments

2

Try to replace newlines with '' and then perform the regexp again.

Comments

1

Try in this way:

$uri = ('http://www.test.com/');
$get = file_get_contents($uri);

$pos1 = strpos($get, "<tr class=\"odd\"><td ><b>Sunday</b> Info<div class=\"test\">test</div></td><td>&euro;");
$pos2 = strpos($get, "*</td></tr>", $pos1);
$text = substr($get,$pos1,$pos2-$pos1);
$text1 = strip_tags($text);

Comments

0

Using PHP DOMDocument Object. We're going to parse the HTML DOM data from the web page

    $dom = new DOMDocument();
    $dom->loadHTML($data);

    $trs = $dom->getElementsByTagName('tr'); // this gives us all the tr elements on the webpage

    // loop through all the tr tags
    foreach($trs as $tr) {
        // until we get one with the class 'odd' and has a b tag value of SUNDAY
        if ($tr->getAttribute('class') == 'odd' && $tr->getElementsByTagName('b')->item(0)->nodeValue == 'Sunday') {
            // now set the price to the node value of the second td tag
            $price = trim($tr->getElementsByTagName('td')->item(1)->nodeValue);
            break;
        }

    }

Instead of using DOMDocument for web scraping, it's a bit tedious, you can get your hands on SimpleHtmlDomParser, it's open source.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.