PHP web scraping

Question

I use php web scraping, and I want to get the price (3.65) on Sunday form the html code below:

     <tr class="odd">
       <td >
           <b>Sunday</b> Info
           <div class="test">test</div>
       </td>
       <td>
       &euro; 3.65 *

       </td>
    </tr>

But I don't find the best regex to do this... I use this php code:

    <?php
        $data = file_get_contents('http://www.test.com/');

        preg_match('/<tr class="odd"><td ><b>Sunday</b> Info<div class="test">test<\/div><\/td><td>&euro; (.*) *<\/td><\/tr>/i', $data, $matches);
        $result = $matches[1];
    ?>

But no result... What's wrong in the regex? (I think it's because of the new lines/spaces?)

regex on "€ ([0-9.]*) " instead to get the price. If it's among others, you could split() it first. Watch out for special regex characters too, like the obvious * after the price! — Waygood
– Waygood, Commented Aug 6, 2012 at 11:55
But I also need to use the "Sunday", because there are also other days... — francisMi
– francisMi, Commented Aug 6, 2012 at 11:58
/Sunday(.*)€ ([0-9.]*)/s will give me the longest possible answer, is there a way to get the shortest answer? If that's possible, that could work... — francisMi
– francisMi, Commented Aug 6, 2012 at 12:28
If you don't have permission to scrape from the site, then don't do it. If you do have permission, then ask for a pricelist feed in XML, which will be designed for data extraction. — Bobulous
– Bobulous, Commented Aug 6, 2012 at 18:29

Martin · Accepted Answer · 2012-08-06 11:30:59Z

6

Don't use regular expressions, HTML is not regular.

Instead, use a DOM Tree parser like DOMDocument. This documentation may help you.

The /s switch should help you with your original regex though I haven't tried it.

answered Aug 6, 2012 at 11:30

Martin

6,7054 gold badges28 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

coding Bott · Accepted Answer · 2012-08-06 11:31:29Z

3

The problems are the spaces between the tags. there a line breaks, tabs and/or spaces.

your regex doesn't match to them.

you also need to setup your preg_match for multiline!

i think it is more easy to use xpath for scraping.

answered Aug 6, 2012 at 11:31

coding Bott

4,3871 gold badge29 silver badges44 bronze badges

Comments

matteomattei · Accepted Answer · 2012-08-06 11:33:36Z

2

Try to replace newlines with '' and then perform the regexp again.

answered Aug 6, 2012 at 11:33

matteomattei

6605 silver badges9 bronze badges

Comments

Stefano · Accepted Answer · 2017-03-23 10:44:51Z

1

Try in this way:

$uri = ('http://www.test.com/');
$get = file_get_contents($uri);

$pos1 = strpos($get, "<tr class=\"odd\"><td ><b>Sunday</b> Info<div class=\"test\">test</div></td><td>&euro;");
$pos2 = strpos($get, "*</td></tr>", $pos1);
$text = substr($get,$pos1,$pos2-$pos1);
$text1 = strip_tags($text);

answered Mar 23, 2017 at 10:44

Stefano

701 gold badge1 silver badge8 bronze badges

Comments

user7248763 · Accepted Answer · 2017-09-15 04:22:53Z

Using PHP DOMDocument Object. We're going to parse the HTML DOM data from the web page

    $dom = new DOMDocument();
    $dom->loadHTML($data);

    $trs = $dom->getElementsByTagName('tr'); // this gives us all the tr elements on the webpage

    // loop through all the tr tags
    foreach($trs as $tr) {
        // until we get one with the class 'odd' and has a b tag value of SUNDAY
        if ($tr->getAttribute('class') == 'odd' && $tr->getElementsByTagName('b')->item(0)->nodeValue == 'Sunday') {
            // now set the price to the node value of the second td tag
            $price = trim($tr->getElementsByTagName('td')->item(1)->nodeValue);
            break;
        }

    }

Instead of using DOMDocument for web scraping, it's a bit tedious, you can get your hands on SimpleHtmlDomParser, it's open source.

Collectives™ on Stack Overflow

PHP web scraping

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related