Web bugs for job scheduling: hack or solution?

One of the strange realities of PHP is many projects build their design around the lowest common demoninator – the shared web host, and for obvious reasons. Restrictions placed by web hosts on the use of their accounts e.g. no shell access, read only filesystem or limited ability to set filesystem permissions, wierd PHP configurations etc . One example that springs to mind; was recently taking another look at Drupal and ran into this snippet on caching;

With the cache turned on, Drupal stores all the HTML code for any page visited by an anonymous user directly into the database. When another request for the same page comes along, Drupal knows to fetch this page out of the database rather than re-generating it from scratch. The result is that hundreds of queries are replaced with one single query, thereby significantly lightening the load on the web server.

I assume it’s done this way because it’s easy for normal Drupal users to control on a shared host. That’s not to say it isn’t a valid solution just that it would typically be faster to cache to a file.

Anyway, perhaps one of the most notorious hacks of that nature is the web bug. While web bugs have a bad rap, as a mechanism for tracking users without their realising, their use doesn’t have to be for evil. In fact pseudocron shows exactly how far people are willing to go, when they can’t get access to the shell.

You “include” pseudocron on your site by linking to it in an HTML image tag. It displays a 1×1 transparent image then proceeds to execute PHP scripts based on the pseudocron schedule file. The advantage of this approach is the browser fires off a seperate HTTP request, corresponding to a seperate Apache process, so any code executed by pseudocron is run “out of band” from the main PHP script – visitors to your site aren’t subjected to a long delay if a “cron job” happens to be running. Well almost…

Recently Dokuwiki gained an indexer to improve the performance of it’s search functionality. Dokuwiki doesn’t use a database – wiki pages are stored directly in files and, until recently, each new search request did a “full scan” on the content (slow and resource intensive – some notes here). The new indexer solves this problem and is also “triggered” using a web bug. Each time it is requested, it checks to see if the current wiki page (containing the web bug) has changed and needs re-indexing.

On the first release containing the indexer though, some users reported problems that “page loading” had become very slow. It turned out that the way the web bug was working, the 1×1 image was only getting displayed after the indexing had completed. That meant, although a browser had received the full wiki content, it was hanging around (with an open HTTP connection) waiting for the image to arrive. This resulted in a throbber that kept on throbbing, giving users the impression that the page was still loading. Was surprised to see pseudocron has similar problems – while the image is displayed immediately, it gives no indication to the browser that the image has finished so could leave the connection open.

To fix this in Dokuwiki, the 1×1 image was rendered immediately in the indexer script and the browser sent a Content Length header that would instruct it to drop the connection once it had the full image. Normally when the HTTP connection is a dropped by the client, the corresponding PHP script will be killed. But this behaviour can be overruled with ignore_user_abort(), so the start of indexer.php became;


<?php
/**
 * DokuWiki indexer
 *
 * @license    GPL 2 (http://www.gnu.org/licenses/gpl.html)
 * @author     Andreas Gohr
 */
 
/**
 * Just send a 1x1 pixel blank gif to the browser and exit
 */
function sendGIF(){
    $img = base64_decode('R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAEALAAAAAABAAEAAAIBTAA7');
    header('Content-Type: image/gif');
    header('Content-Length: '.strlen($img));
    header('Connection: Close');
    print $img;
    // Browser should drop connection after this
    // Thinks it's got the whole image
}
 
// Make sure image is sent to the browser immediately
ob_implicit_flush(TRUE);
 
// keep running after browser closes connection
@ignore_user_abort(true);
 
sendGIF();

// Browser is now gone...
 
// Switch off implicit flush again - we don't want to send any more output
ob_implicit_flush(FALSE);
 
// Catch any possible output (e.g. errors)
// - probably not needed but better safe...
ob_start();

// Start the real work now...

So far so good. But there’s another negative effect of using a web bug. For every page request a user makes, two Apache processes are getting tied up with running PHP scripts. If the web bug script is long running, resource intensive stuff, a large number of visitors could result in a large number of web bugs running in parallel, locking up Apache child processes while eating CPU and memory left, right and center. As a side note, contrast that with what George is advising here;

Offloading static content. If the average page in your Web application contains nine images, then only ten percent of the requests to your Web server actually used the persistent connections they have assigned to them. In other words, ninety percent of the requests are wasting a valuable (and expensive, from a scalability standpoint) Oracle [persistent] connection handle. Your goal should be to ensure that only requests that require Oracle connectivity (or at least require dynamic content) are served off of your dynamic Web server. This will increase the amount of Oracle-related work done by each process, which in turn reduces the number of children required to generate dynamic content…The easiest way to promote this is by offloading all of your images onto a separate Web server (or set of Web servers).

In fact this wasn’t a problem for Dokuwiki, thanks to a side effect of Andi’s careful approach to indexing.

To avoid race conditions with multiple indexing processing running at the same time, trying to write to the same files, Andi applied the simple rule that only one indexer was allow to run at any given time. To implement this he added a locking mechanism whereby the indexer creates a directory (always using the same directory name / path) as it starts running then removes it when it’s finished. Creating a directory should be more efficient (and is less lines of code) than creating a file, as it’s just a matter of updating the inode database. If the Dokuwiki indexer can’t obtain a lock (the lock directory exists) it exits immediately, which frees up the Apache child process for more work.

In end effect the web bug used by Dokuwiki is probably as effecient and robust as it can be. While the general concept may be remeniscient of generating electricity from a hamster wheel, given an efficient solution, it begs the question: if you have a host which does allow you use of cron, would you still be tempted to live with Dokuwiki’s web bug? Still a dodgy hack or a valid solution?

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://blog.casey-sweat.us/ sweatje

    No salient comments on this blog post, but I just noticed the author of this post and the prior one :) Welcome back Harry, and I hope we continue to see you round these parts more often!

  • Pingback: Professional PHP

  • php_man

    Another welcome back Harry. I was using the same sort of thing for my own purpose but hadn’t added ob_start(); which is a good idea. Thanks for the tip :)

  • shea

    No, why would you not use cron when you have the option to? Right tool for the job yeah? However, if the question was – would you use this “dodgy hack” if cron wasn’t available? My answer would be a (obviously) resounding yes. This solution is totally transparent to the user (and to the web server) so there is no reason not to. Although in the most strict definition of the term it is still a “hack”, I’m sure we can all drop the “dodgy” and utilise this valid solution.

  • http://www.phppatterns.com HarryF

    No, why would you not use cron when you have the option to? Right tool for the job yeah? However, if the question was—would you use this “dodgy hack” if cron wasn’t available? My answer would be a (obviously) resounding yes.

    Agreed for general task scheduling but what about the argument that the web bug approach is better integrated with the application than cron can be? If you want something that’s triggered by events happening within your application (such as content being updated via a form), a web bug has a better chance of being able to respond directly the to the event (although it can’t be relied on! – lynx users or those still surfing with images disabled, for example, would be a problem).

    In other words think there are a class of problems that can be better solved this way than with cron, such as the Dokuwiki indexer. It’s not a clear distinction but the way Dokuwiki’s indexer works is popular wiki pages have a better chance of getting re-indexed if they change (in fact they’d often be re-indexed immediately after editing, if not other indexer is running). By indexing only one page at a time the overhead is kept reasonably spread. The alternative with cron would likely be something that has to make complete sweeps of all pages and index those that have changed. Each time that job runs it could result is a serious resource hit. Implementing a smarter solution with cron, which spreads the load, would probably turn out more complex than using a web bug.

    Should also have mentioned that PHP’s session garbage collector works on a similar basis – it’s incoming requests that fire the garbage collector. If you have no visitors, the session GC won’t run (so expired sessions will still be hanging around).

    Also, along the lines of George’s tip regarding images, given an environment you control, if you ran the web bug under a seperate server, like thhptd or even nanoweb, on a subdomain, you’re no longer blocking Apache children.

    One other thought (dare I say it?) – this could also work well with AJAX, especially if you need to pass values to the web bug. That also sounds like a legitimate use of AJAX…

  • shea

    Ahh well the purpose of a cron job, which I’m sure you’re aware of, is to run a specified job at set intervals. So we are comparing apples to oranges if what you are after is a system that responds to events rather than set times. This web bug trick would indeed be best for the latter.

    However, I’m not sure I quite get why the indexer needs to be triggered by the web bug trick. Why wouldn’t the indexer be triggered when the process of saving the updates to the wiki occurs?

  • http://www.phppatterns.com HarryF

    So we are comparing apples to oranges if what you are after is a system that responds to events rather than set times.

    Agreed.

    Why wouldn’t the indexer be triggered when the process of saving the updates to the wiki occurs?

    Technically what you’re suggesting is doable – in the script that accepts the update, you could hang up the browser in the same way as above then start indexing.

    But think the main thing here is whether updates then become the only way the indexes are refreshed. Andi has employed the simplest solution to avoid race conditions, with the rule that only one indexer may run at a time. But what it someone updates a page while the indexer is already running? Then you need some other mechanism to refresh the index later. And you also want to be able to reindex in case of corruption / data loss. Think the web bug approach makes the solution alot simpler.

  • http://www.phpism.net Maarten Manders

    Great article, welcome back Harry!

  • dumky

    I’ve needed to have a background thread running in my web apps a number of times. Why not support this functionality in the web server?

  • http://www.phpism.net Maarten Manders

    Dumky, you can use cron jobs that execute PHP command line interface (CLI) scripts. It’s a common way to solve those problems.

  • dumky

    Maarten, using PHP as a CLI scripting language for cron jobs does sound like it would help unifying the development environment.

    But cron job run in a separate process. That means that any result from the job won’t be directly available in memory for new requests being handled, you have to come up with some kind of inter-process communication solution be it the filesystem or other…
    Also, scheduled jobs create an additional deployment requirement.

    One specific scenario where I would have needed a good solution for background threads was generating and refreshing a cache of CAPTCHA images. The cache is in-memory to allow more throughput. Using a background thread to generate new images, without saving them to file, allows for a more integrated solution (less boundaries, no need for ACLing directories).

  • Anonymous

    I would like to recommend you the most fair and authoritative top 10 web hosting list all over the world.
    The to 10 web hosting in November
    #1.LunarPages
    #2.iPowerWeb
    #3.PowWeb
    #4.midPhase
    #5.Startlogic
    #6.Globat
    #7.hostony
    #8.EasyCGI
    #9.dot5
    #10.WebsiteSource
    See http://www.t10host.com for more information

  • fryk

    Great article. But it doesn’t work for redirecs. Could anyone know how to make a redirect and continue script execution?

    Below is not working. It makes redirect after 3 seconds:

    ob_implicit_flush(TRUE);
    @ignore_user_abort(true);

    header("HTTP/1.1 301 Moved Permanently");
    header('Content-Length: 0');
    header('Location: http://example.com');
    header('Connection: Close');

    ob_implicit_flush(FALSE);
    ob_start();

    sleep(3);

    // some code

  • Pingback: PHPit - Totally PHP » Creating a “Who’s Online” script with PHP