0

I am trying to create a a basic web crawler the specifically looks for links from adverts.

I have managed to find a script that uses cURL to get the contents of the target webpage

I also found one that uses DOM

<?php
    $ch = curl_init("http://www.nbcnews.com");
    $fp = fopen("source_code.txt", "w");

    curl_setopt($ch, CURLOPT_FILE, $fp);
    curl_setopt($ch, CURLOPT_HEADER, 0);

    curl_exec($ch);
    curl_close($ch);
    fclose($fp);
?>

These are great and I certainly feel like I'm heading in the right direction except quite a few adverts are displayed using JS and as it's client side, it obviously isn't processed and I only see the JS code and not the ads.

Basically, is there any way of getting the JS to execute before I start trying to extract the links?

Thanks

5
  • 5
    I don't know if there are javascript engine's written in php, but you can do what you want to achieve using phantomjs, which is a headless programmable browser. Commented Sep 4, 2013 at 20:21
  • @KemalDağ You should add that as an answer. Short but correct. Commented Sep 4, 2013 at 21:21
  • is there anyway of rendering the java script serverside? php or not? Commented Sep 4, 2013 at 21:24
  • 2
    Crawling ad links sounds like using an automated service to increase your revenue which will be against the terms of service of all ad services. In other words, you should not be doing it. They put it in javascript for a reason: to prevent you from doing it. Commented Sep 4, 2013 at 21:31
  • @developerwjk, I'm unsure what lead you to that assumption but you are very wrong. You'll also find that ad networks will filter out link hits from crawlers. I'm more interested in finding out who is advertising on certain sites and for how long. Commented Sep 4, 2013 at 21:45

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.