Wide Finder in … errr … PHP

Long time no blog – very busy searching Switzerland these days but popping in first to say congrats to the 2nd edition team and second, for the sheer evil delight in submitting a late PHP implementation for Tim Bray’s fascinating Wide Finder Project.

Tim set a simple, but very much real-world challenge; write an app that determines the top 10 most popular blogs from his Apache access log. It should be fast and readable, with a subtext of illustrating how “language X” copes in terms of parallel processing and utilizing “wider” (many processor) systems.

Now technically, PHP has almost zero ability to execute stuff in parallel, other than pcntl_fork – (don’t try that on your live webserver – CLI only!), unless you’re counting ticks, which I hope you aren’t.

But we’re not going to let little details stop us! Because we’ve got curl_multi_exec() which allows us to talk to multiple URLs and process their responses in parallel. And just think how many requests Apache + mod_php is able to serve at the same time. Perfect for some map and reduce

So first a mapper;


<?php
if ( isset($argv[1]) ) $_GET['f'] = $argv[1];

if ( !isset($_GET['f']) || !is_readable($_GET['f']) ) {
    header("HTTP/1.0 404 Not Found");
	die("Cant read file ".htmlspecialchars($_GET['f']));
}

$s = isset($_GET['s']) && ctype_digit($_GET['s']) ? $_GET['s'] : 0;
$e = isset($_GET['s']) && ctype_digit($_GET['e']) ? $_GET['e'] : filesize($_GET['f']);

$f = fopen($_GET['f'], 'r');
fseek($f, $s);

$counts = array();
while ( !feof($f) && ftell($f) < $e ) {
	$pattern = '#GET /ongoing/When/dddx/(dddd/dd/dd/[^ .]+) #';
	if ( preg_match($pattern, fgets($f), $m) ) {
		isset($counts[$m[1]]) ? $counts[$m[1]]++ : $counts[$m[1]] = 1;
	}

}
fclose($f);

$out = serialize($counts);
header("Content-Length: ".strlen($out));
echo $out;

You point it at a file it can reach via it’s filesystem plus give it a range of bytes ( a start and end ) to analyze. It returns a serialized hash, the keys being the blog entry titles and the values the number of occurences. A URL to call it might look like http://localhost/~harry/wf/wf_map.php?f=/tmp/o10k.ap&s=257664&e=524288 . Also for benchmarking it can be called from the command line to process a complete file in a single thread like $ php ./wf_map.php /tmp/o10k.ap.

Then we have the reducer, which actually invokes the mapper(s) using curl;


#!/usr/bin/php
<?php
function wf_reduce($urls) {

	$counts = array();
	
	$mh = curl_multi_init();
	
	foreach ( $urls as $url ) {
		$ch = curl_init($url);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_multi_add_handle($mh, $ch);
	}

	do {
		curl_multi_exec($mh,$running);
		while ( FALSE !==($msg = curl_multi_info_read($mh)) ) {
			
			if ( "200" != ( $status = curl_getinfo ( $msg['handle'], CURLINFO_HTTP_CODE ) ) ) {
				die("HTTP Status: $status!");
			}
			
			// Here we can "reduce" as soon as we get a response...
			foreach ( unserialize( curl_multi_getcontent ( $msg['handle'] )  ) as $page => $hits ) {
				isset($counts[$page]) ? $counts[$page] += $hits : $counts[$page] = $hits;
			}
			
		}
	} while ($running > 0);

	return $counts;
	
}

$mapurl = "http://localhost/~harry/wf/wf_map.php";
$file = "/tmp/o10k.ap";

if (!file_exists($file) ) {
	print "First run: downloading sample data!";
	file_put_contents($file, file_get_contents("http://www.tbray.org/tmp/o10k.ap"));
	die();
}

$partitions = isset($argv[1]) && ctype_digit($argv[1]) ? $argv[1] : 1;

$end = filesize($file);
$partition = floor($end / $partitions);

$s = 0;
$e = $partition;
$urls = array();

while ( $e <= $end ) {
	$urls[] = sprintf("%s?f=%s&s=%s&e=%s", $mapurl, urlencode($file), $s, $e);
	$s = $e;
	$e += $partition;
}

$counts = wf_reduce($urls);
arsort($counts, SORT_NUMERIC);

$top10 = array_slice($counts, 0, 10, TRUE);
print_r( array_reverse($top10) );

It is meant purely for command line execution, taking one argument which is the number of parallel “threads” which should run e.g. $ ./wf_reducer.php 10 which will fire off 10 HTTP requests in parallel to the mapper, requesting it parse a different “chunk” of the file each time. You’ll need to change the $mapurl.

Doubt this will win any awards for beauty and benchmarks with Tim’s sample file are not impressive – it’s faster to run single threaded – perhaps you’d start gaining with a much bigger file? My desire to find out is diminishing.

But anyway – PHP was also there. Wonder if Tim will now compile it on that T5120 ? ;)

Update: Just discovered Russell Beattie got there first with a single threaded solution.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • krdr

    I just take a look on Russell Beattie solution. Quite simple and understandable. What bugs me is times that his solution obtains if it is compiled by Zend Guard or Nusphere NuCoder.

  • http://www.webexc.com/ BrookeA

    Excellent info – Thank you!

  • Ronnie

    Death penalty for spammers NOW

  • Valorie Summers

    drosometer crestline hexachlorocyclohexane mellowy delapsion nonrecourse protype perorator
    The Med Tech Connection
    http://www.thepassionplay.com/