0

I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.

     require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
 //links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab  
// grabs the urls from URL 
$file  = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
   $links[] = url_to_absolute($URL, $theelement->href);
} 
print_r($links);
   //titles
  $titles = Array();
  $str = file_get_contents($URL);  
  $titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );

   print_r($title[1]);
4
  • 1
    Can you give an example of what you'd expect this to output? Commented Sep 16, 2012 at 13:51
  • 1
    What does the HTML you are scraping look like? Your methodology seems flawed to use a DOM parser to retrieve the <a> tags, then separately a regex to retrieve the <title>. And post an example what your output should look like. Commented Sep 16, 2012 at 13:52
  • Yes, please post an example of what you want as output. Sincerely, your current description is incomprehensible. Commented Sep 16, 2012 at 14:01
  • the example of what i would like is say: Array => google.com => Google Commented Sep 16, 2012 at 15:59

2 Answers 2

1

You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.

$newArray = array();

        foreach ($links as $key=>$val)
        {
            $newArray[$key]['link'] = $val;
            $newArray[$key]['title'] = $titles[$key];
        }
Sign up to request clarification or add additional context in comments.

1 Comment

there is no titles for display in the script above. it creates exactly what i want, except it is not scanning the url for their titles and returning them to their title value
0

It is not clear what you want.

Anyway, here is how I would rewrite your code in a more organized way:

require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');

$info = array();

$urls = array(
    'http://www.youtube.com',
    'http://www.google.com.br'
);

foreach ($urls as $url)
{
    $str = file_get_contents($url);
    $html = str_get_html($str);

    $title = strval($html->find('title')->plaintext);

    $links = array();
    foreach($html->find(a) as $anchor)
    {
        $links[] = url_to_absolute($url, strval($anchor->href));
    }
    $links = array_unique($links);

    $info[$url] = array(
        'title' => $title,
        'links' => $links
    );
}

print_r($info);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.