Long time no blog – very busy searching Switzerland these days but popping in first to say congrats to the 2nd edition team and second, for the sheer evil delight in submitting a late PHP implementation for Tim Bray’s fascinating Wide Finder Project.
Tim set a simple, but very much real-world challenge; write an app that determines the top 10 most popular blogs from his Apache access log. It should be fast and readable, with a subtext of illustrating how “language X” copes in terms of parallel processing and utilizing “wider” (many processor) systems.
Now technically, PHP has almost zero ability to execute stuff in parallel, other than pcntl_fork – (don’t try that on your live webserver – CLI only!), unless you’re counting ticks, which I hope you aren’t.
But we’re not going to let little details stop us! Because we’ve got curl_multi_exec() which allows us to talk to multiple URLs and process their responses in parallel. And just think how many requests Apache + mod_php is able to serve at the same time. Perfect for some map and reduce…
So first a mapper;
<?php
if ( isset($argv[1]) ) $_GET['f'] = $argv[1];
if ( !isset($_GET['f']) || !is_readable($_GET['f']) ) {
header("HTTP/1.0 404 Not Found");
die("Cant read file ".htmlspecialchars($_GET['f']));
}
$s = isset($_GET['s']) && ctype_digit($_GET['s']) ? $_GET['s'] : 0;
$e = isset($_GET['s']) && ctype_digit($_GET['e']) ? $_GET['e'] : filesize($_GET['f']);
$f = fopen($_GET['f'], 'r');
fseek($f, $s);
$counts = array();
while ( !feof($f) && ftell($f) < $e ) {
$pattern = '#GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) #';
if ( preg_match($pattern, fgets($f), $m) ) {
isset($counts[$m[1]]) ? $counts[$m[1]]++ : $counts[$m[1]] = 1;
}
}
fclose($f);
$out = serialize($counts);
header("Content-Length: ".strlen($out));
echo $out;
You point it at a file it can reach via it’s filesystem plus give it a range of bytes ( a start and end ) to analyze. It returns a serialized hash, the keys being the blog entry titles and the values the number of occurences. A URL to call it might look like http://localhost/~harry/wf/wf_map.php?f=/tmp/o10k.ap&s=257664&e=524288 . Also for benchmarking it can be called from the command line to process a complete file in a single thread like $ php ./wf_map.php /tmp/o10k.ap.
Then we have the reducer, which actually invokes the mapper(s) using curl;
#!/usr/bin/php
<?php
function wf_reduce($urls) {
$counts = array();
$mh = curl_multi_init();
foreach ( $urls as $url ) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_multi_add_handle($mh, $ch);
}
do {
curl_multi_exec($mh,$running);
while ( FALSE !==($msg = curl_multi_info_read($mh)) ) {
if ( "200" != ( $status = curl_getinfo ( $msg['handle'], CURLINFO_HTTP_CODE ) ) ) {
die("HTTP Status: $status!");
}
// Here we can "reduce" as soon as we get a response...
foreach ( unserialize( curl_multi_getcontent ( $msg['handle'] ) ) as $page => $hits ) {
isset($counts[$page]) ? $counts[$page] += $hits : $counts[$page] = $hits;
}
}
} while ($running > 0);
return $counts;
}
$mapurl = "http://localhost/~harry/wf/wf_map.php";
$file = "/tmp/o10k.ap";
if (!file_exists($file) ) {
print "First run: downloading sample data!";
file_put_contents($file, file_get_contents("http://www.tbray.org/tmp/o10k.ap"));
die();
}
$partitions = isset($argv[1]) && ctype_digit($argv[1]) ? $argv[1] : 1;
$end = filesize($file);
$partition = floor($end / $partitions);
$s = 0;
$e = $partition;
$urls = array();
while ( $e <= $end ) {
$urls[] = sprintf("%s?f=%s&s=%s&e=%s", $mapurl, urlencode($file), $s, $e);
$s = $e;
$e += $partition;
}
$counts = wf_reduce($urls);
arsort($counts, SORT_NUMERIC);
$top10 = array_slice($counts, 0, 10, TRUE);
print_r( array_reverse($top10) );
It is meant purely for command line execution, taking one argument which is the number of parallel “threads” which should run e.g. $ ./wf_reducer.php 10 which will fire off 10 HTTP requests in parallel to the mapper, requesting it parse a different “chunk” of the file each time. You’ll need to change the $mapurl.
Doubt this will win any awards for beauty and benchmarks with Tim’s sample file are not impressive – it’s faster to run single threaded – perhaps you’d start gaining with a much bigger file? My desire to find out is diminishing.
But anyway – PHP was also there. Wonder if Tim will now compile it on that T5120 ? ;)
Update: Just discovered Russell Beattie got there first with a single threaded solution.
Related posts:
- Introducing php-tracer-weaver php-tracer-weaver is a tool for automatically generating docblock comments, with...
- How to Use PHP Namespaces, Part 3: Keywords and Autoloading In the final part of his series explaining PHP namespaces,...
- How to Install PHP on Windows In his final installation tutorial, Craig provides a step-by-step guide...
- Ready for PHP & MySQL Week at SitePoint? To celebrate the release of the new edition of well-loved...
- How to Use PHP Namespaces, Part 1: The Basics In the first part of a series of articles, Craig...







I just take a look on Russell Beattie solution. Quite simple and understandable. What bugs me is times that his solution obtains if it is compiled by Zend Guard or Nusphere NuCoder.
November 4th, 2007 at 9:11 am
Excellent info – Thank you!
November 6th, 2007 at 3:05 am
Death penalty for spammers NOW
November 6th, 2007 at 8:20 am
drosometer crestline hexachlorocyclohexane mellowy delapsion nonrecourse protype perorator
The Med Tech Connection
http://www.thepassionplay.com/
November 24th, 2007 at 6:24 pm