Regex and PHP help

I’m trying to learn more about Regex by scraping some sites and trying to pull just URL’s out of the HTML, but having some issues with it. While I have been successful in pulling out URL’s of this particular site, I’ve also been grabbing tons of text with it, so it’s not really working well for me.

Please note that I’m not looking for links, just URL’s.

My code looks like this:

$domain = 'google.com';
 
$page = 0;

$file = file_get_contents('http://www.alexa.com/site/linksin;' . $page . '/' . $domain);
 
preg_match_all('(<div class="site-listing">.*<a\\s* href="(.*?)\\/siteinfo\\/(.*?)" class="title">)siU', $file, $matches, PREG_SET_ORDER);
 


foreach($matches as $match)
{
	echo strip_tags($match[0]) . "\
<br />";
        //echo strip_tags($match[1]) . "\
<br />";
}

//echo count($match[0]);


I’ve tried all kinds of different tricks that I know of, but none are working for me. Hopefully someone can help me with this as I want to learn more about regex, so would like to know what I’m doing wrong.

Something that’s starting to confuse me is the array index for this… is the key 0 the first part of the regex where it says B [/B]? I thought it was, but with all of the issues I’ve run across today, it’s starting to confuse me more and more, but I do tend to over analyze things.

Thanks in advance!

No, $match[0] is the complete string that was matched, $match[1] is the first atom, i.e., the first (.*?), $match[2] is the second atom, etc.

BTW. (.*?) is not the best regex to use here (okay, it almost never is the best regex to use). For the first atom ([^/]+) would be better (match anything except for a forward slash), and for the second atom ([^"]+) would be better (match anything except for a double quote).

An alternative. :slight_smile:


<?php
$page = new DOMDocument;
$page->loadHTMLFile('http://www.alexa.com/site/linksin/google.com');

$divs = $page->getElementsByTagName('div');

for($pos = 0; $pos < $divs->length; $pos++){
  
  $div = $divs->item($pos);
  
  $isSiteInfoDiv = $div->hasAttribute('class') && 'site-listing' === $div->getAttribute('class');
  
  if(false === $isSiteInfoDiv){
    continue;
  }
  
  list($site) = sscanf(
    $div->getElementsByTagName('a')->item(0)->getAttribute('href'),
    '/siteinfo/%s'
  );
  
  echo $site, PHP_EOL ;
}

/*
  youtube.com
  hi5.com
  wretch.cc
  google.com.br
  free.21cn.com
  google.fr
  images.google.com
  metroflog.com
  orkut.com.br
  fotolog.net
  friendster.com
  google.es
  mail.google.com
  google.com.mx
  wikipedia.org
  google.co.uk
  baidu.com
  google.de
  flickr.com
  images.google.com.br
*/

Scallio, thanks a lot man! This worked exactly as I had been hoping and I learned something new!

I appreciate it :slight_smile:

I actually wanted to learn about regex, but I do plan to learn more about the DOM, so I appreciate this! :smiley: