0

I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.

     require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
 //links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab  
// grabs the urls from URL 
$file  = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
   $links[] = url_to_absolute($URL, $theelement->href);
} 
print_r($links);
   //titles
  $titles = Array();
  $str = file_get_contents($URL);  
  $titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );

   print_r($title[1]);
4
  • 1
    Can you give an example of what you'd expect this to output? Commented Sep 16, 2012 at 13:51
  • 1
    What does the HTML you are scraping look like? Your methodology seems flawed to use a DOM parser to retrieve the <a> tags, then separately a regex to retrieve the <title>. And post an example what your output should look like. Commented Sep 16, 2012 at 13:52
  • Yes, please post an example of what you want as output. Sincerely, your current description is incomprehensible. Commented Sep 16, 2012 at 14:01
  • the example of what i would like is say: Array => google.com => Google Commented Sep 16, 2012 at 15:59

2 Answers 2

1

You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.

$newArray = array();

        foreach ($links as $key=>$val)
        {
            $newArray[$key]['link'] = $val;
            $newArray[$key]['title'] = $titles[$key];
        }
Sign up to request clarification or add additional context in comments.

1 Comment

there is no titles for display in the script above. it creates exactly what i want, except it is not scanning the url for their titles and returning them to their title value
0

It is not clear what you want.

Anyway, here is how I would rewrite your code in a more organized way:

require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');

$info = array();

$urls = array(
    'http://www.youtube.com',
    'http://www.google.com.br'
);

foreach ($urls as $url)
{
    $str = file_get_contents($url);
    $html = str_get_html($str);

    $title = strval($html->find('title')->plaintext);

    $links = array();
    foreach($html->find(a) as $anchor)
    {
        $links[] = url_to_absolute($url, strval($anchor->href));
    }
    $links = array_unique($links);

    $info[$url] = array(
        'title' => $title,
        'links' => $links
    );
}

print_r($info);

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.