0

I've been scratching my head for days over this stupid one.

I have an array of urls called $url_array pulled from the database like so -

Array (
    [id] => 2
    [url] => http://example.com
)

I have foreach loop which runs over $url_array and scrapes the url for data like so -

foreach ($url_array as $row) {
    $data = $this->scrapePage($row["url"]);
    print_r($data);
    return false;
}

Currently $data is outputting nothing. But if I replace $row["url"] with http://example.com, the scrape happens correctly.

This is the first time I've also hosted this script on DigitalOcean so I'm not sure if there are any server technicalities possibly stopping a foreach loop from working.

edit: Here is the scrapePage function -

private function scrapePage($url) {
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_COOKIESESSION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept-Charset: utf-8'));
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_VERBOSE, true);

    $content = curl_exec($ch);
    $header = curl_getinfo($ch);
    curl_close($ch);

    return array("header" => $header, "content" => $content);
}

Like I said, if I manually enter a url in there, it works fine, just not when in a loop.

As for the $url_array, this is the output when I print it out -

Array
(
    [0] => Array
        (
            [id] => 41
            [url] => http://www.example1.com
        )

    [1] => Array
        (
            [id] => 85
            [url] => http://test-url-2.com
        )
)

I've also tried a for loop over the data. If I modify the scrapePage function to return the $url, it returns the $url correctly.

5
  • can you please post your scrapePage function Commented Oct 27, 2017 at 13:02
  • 2
    Is $url_array exactly as the array you posted above? Or is that just one subarray from a larger, multidimensional array that you are not showing? Commented Oct 27, 2017 at 13:03
  • Just debug into scrapePage or add a log-statement to it, logging $url -- see whats really happening. Commented Oct 27, 2017 at 13:15
  • @TomRegner can you please elaborate on how to do that? All I've done is print the $row["url"] value going into the function, and from scrapePage, I return the $url straight away and I get the same url. But if I use $this->scrapePage("example.com"); it works fine Commented Oct 27, 2017 at 13:19
  • If you manually create an array of 2-3 URLs and iterate over it does it work? If yes then maybe try to create it, iterate over both at once($key=>$row) and compare(===) the URLs from file and db arrays. If no then try to add sleep(2) in your loop. Commented Oct 27, 2017 at 13:49

2 Answers 2

1

After much headache, I've found the issue. The database of urls I had looked like this -

http://www.example1.com\r
http://www.example2.com\r
http://www.example3.com\r
http://www.example4.com\r

Note the "\r" at the end, that was messing up cURL. I had assumed the database I was given was clean. Apparently not! I just removed all the trailing \r's and all the code works as expected.

Sign up to request clarification or add additional context in comments.

Comments

0

Your $url_array is nested, you should try following to get the urls and use your scrapePage function:

foreach ($url_array as $row => $value) {
    foreach ($value as $row => $value) {
        if($row === 'url') {
        //$urls[]=$value;
        $data = $this->scrapePage($value);
        print_r($data);
        }
    }
}

1 Comment

I get the same result unfortunately. All content and header data

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.