Web bugs for job scheduling: hack or solution?

One of the strange realities of PHP is many projects build their design around the lowest common demoninator – the shared web host, and for obvious reasons. Restrictions placed by web hosts on the use of their accounts e.g. no shell access, read only filesystem or limited ability to set filesystem permissions, wierd PHP configurations etc . One example that springs to mind; was recently taking another look at Drupal and ran into this snippet on caching;

With the cache turned on, Drupal stores all the HTML code for any page visited by an anonymous user directly into the database. When another request for the same page comes along, Drupal knows to fetch this page out of the database rather than re-generating it from scratch. The result is that hundreds of queries are replaced with one single query, thereby significantly lightening the load on the web server.

I assume it’s done this way because it’s easy for normal Drupal users to control on a shared host. That’s not to say it isn’t a valid solution just that it would typically be faster to cache to a file.

Anyway, perhaps one of the most notorious hacks of that nature is the web bug. While web bugs have a bad rap, as a mechanism for tracking users without their realising, their use doesn’t have to be for evil. In fact pseudocron shows exactly how far people are willing to go, when they can’t get access to the shell.

You “include” pseudocron on your site by linking to it in an HTML image tag. It displays a 1×1 transparent image then proceeds to execute PHP scripts based on the pseudocron schedule file. The advantage of this approach is the browser fires off a seperate HTTP request, corresponding to a seperate Apache process, so any code executed by pseudocron is run “out of band” from the main PHP script – visitors to your site aren’t subjected to a long delay if a “cron job” happens to be running. Well almost…

Recently Dokuwiki gained an indexer to improve the performance of it’s search functionality. Dokuwiki doesn’t use a database – wiki pages are stored directly in files and, until recently, each new search request did a “full scan” on the content (slow and resource intensive – some notes here). The new indexer solves this problem and is also “triggered” using a web bug. Each time it is requested, it checks to see if the current wiki page (containing the web bug) has changed and needs re-indexing.

On the first release containing the indexer though, some users reported problems that “page loading” had become very slow. It turned out that the way the web bug was working, the 1×1 image was only getting displayed after the indexing had completed. That meant, although a browser had received the full wiki content, it was hanging around (with an open HTTP connection) waiting for the image to arrive. This resulted in a throbber that kept on throbbing, giving users the impression that the page was still loading. Was surprised to see pseudocron has similar problems – while the image is displayed immediately, it gives no indication to the browser that the image has finished so could leave the connection open.

To fix this in Dokuwiki, the 1×1 image was rendered immediately in the indexer script and the browser sent a Content Length header that would instruct it to drop the connection once it had the full image. Normally when the HTTP connection is a dropped by the client, the corresponding PHP script will be killed. But this behaviour can be overruled with ignore_user_abort(), so the start of indexer.php became;


<?php
/**
 * DokuWiki indexer
 *
 * @license    GPL 2 (http://www.gnu.org/licenses/gpl.html)
 * @author     Andreas Gohr
 */
 
/**
 * Just send a 1x1 pixel blank gif to the browser and exit
 */
function sendGIF(){
    $img = base64_decode('R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAEALAAAAAABAAEAAAIBTAA7');
    header('Content-Type: image/gif');
    header('Content-Length: '.strlen($img));
    header('Connection: Close');
    print $img;
    // Browser should drop connection after this
    // Thinks it's got the whole image
}
 
// Make sure image is sent to the browser immediately
ob_implicit_flush(TRUE);
 
// keep running after browser closes connection
@ignore_user_abort(true);
 
sendGIF();

// Browser is now gone...
 
// Switch off implicit flush again - we don't want to send any more output
ob_implicit_flush(FALSE);
 
// Catch any possible output (e.g. errors)
// - probably not needed but better safe...
ob_start();

// Start the real work now...

So far so good. But there’s another negative effect of using a web bug. For every page request a user makes, two Apache processes are getting tied up with running PHP scripts. If the web bug script is long running, resource intensive stuff, a large number of visitors could result in a large number of web bugs running in parallel, locking up Apache child processes while eating CPU and memory left, right and center. As a side note, contrast that with what George is advising here;

Offloading static content. If the average page in your Web application contains nine images, then only ten percent of the requests to your Web server actually used the persistent connections they have assigned to them. In other words, ninety percent of the requests are wasting a valuable (and expensive, from a scalability standpoint) Oracle [persistent] connection handle. Your goal should be to ensure that only requests that require Oracle connectivity (or at least require dynamic content) are served off of your dynamic Web server. This will increase the amount of Oracle-related work done by each process, which in turn reduces the number of children required to generate dynamic content…The easiest way to promote this is by offloading all of your images onto a separate Web server (or set of Web servers).

In fact this wasn’t a problem for Dokuwiki, thanks to a side effect of Andi’s careful approach to indexing.

To avoid race conditions with multiple indexing processing running at the same time, trying to write to the same files, Andi applied the simple rule that only one indexer was allow to run at any given time. To implement this he added a locking mechanism whereby the indexer creates a directory (always using the same directory name / path) as it starts running then removes it when it’s finished. Creating a directory should be more efficient (and is less lines of code) than creating a file, as it’s just a matter of updating the inode database. If the Dokuwiki indexer can’t obtain a lock (the lock directory exists) it exits immediately, which frees up the Apache child process for more work.

In end effect the web bug used by Dokuwiki is probably as effecient and robust as it can be. While the general concept may be remeniscient of generating electricity from a hamster wheel, given an efficient solution, it begs the question: if you have a host which does allow you use of cron, would you still be tempted to live with Dokuwiki’s web bug? Still a dodgy hack or a valid solution?