The sysadmin view on "Why PHP"

A funny from the Python crowd: phpfilter – PHP “support” under CherryPy. There is a serious side to that though – it’s spitting out something that looks like a PHP parse error – i.e. this is a developer problem (e.g. someone ftp’d a PHP straight onto their live web server for “testing”), not a runtime error.

More to the point, when was the last time you saw a PHP runtime error take down an entire application or web server? And no – “MySQL Connection Failed: Can’t connect to local MySQL server” doesn’t count – PHP and the web server are still running – the MySQL server (or otherwise) is to blame.

With PHP it’s very hard for a script to take down the runtime environment – the web server – I’d argue that you’d have to be deliberately trying to do so, perhaps filling up disk space or otherwise. Innocent mistakes, specific instances of runtime problems (e.g. script execution too long) and bugs remain local to specific requests and the PHP script handling them. On the next request, we begin again from scratch.

It may now be reasonable to claim that Apache + mod_php has served more HTTP requests for dynamic pages than any other comparable environment. Despite warts and all, this is tested sortware simply by weight of numbers. That translates into a platform which costs little to keep running and less chance of a wakeup call at 2am.

Anyway – ran into an excellent blog recently: FastCGI, SCGI, and Apache: Background and Future discussing the options given new demand for FastCGI with frameworks like Rails, seen from the eyes of a sysadmin. To a great extent it also explains why we’ve ended up with PHP.

To really grasp the discussion Mark Mayo is making it’s worth having a rough idea of the most common technical approaches used to implement servers able to handle multiple web page requests (concurrently) and pass the request through to a program (e.g. a PHP script) for processing. Note this isn’t meant to be an in depth guide to multitasking – it’s more my dummy understanding / view. A good place to start if you want something more meaty is on Wikipedia here.

Forking: an HTTP server process spawns a child processes to handle each incoming request, the children either expiring (exit) or returning to a “pool” for reuse, when the request is finished (Apache 1.3x does the latter).
With Apache + CGI scripts, the Apache child processes must, in turn, fork further child processes within which the CGI program runs, so it get’s pretty slow. FastCGI eliminates that by keeping the CGI process running for further requests (but needs a bunch more complexity to do so).

With mod_php, the script is run inside the Apache child process itself. This reduces the overhead of a further fork and means the PHP “runtime” only needs to be loaded when an Apache child is created.

Forking is nice in terms of being relatively easy to implement and that (for the most part) multitasking issues are not pushed onto application developers.

Another thing that makes this model popular with sysadmins is child processes can “crash” (e.g. that infinite loop in your PHP script) without taking out the main server process – this is probably the number 1 reason why shared hosts are willing to install mod_php – they don’t have to keep restarting the server as a result of what their customers did to it.

Also, particular to CGI, it’s easier to push security issues off to the operating system, allowing user scripts to be run with their permissions rather than the permissions of the web server user.

This is not the case with mod_php, which violates normal UNIX filesystem security. PHP scripts only have to be readable on the filesystem for mod_php to execute them.

The downside of forking is it’s (relatively) slow / expensive to fork a new process and each child gobbles up memory and resources while it’s running, where it might be more efficient to share. The mod_php approach is the simplest way to keep this cost to a minimum.

Also Windows doesn’t really support UNIX-style forking, placing greater emphasis on threading, which may be a problem if you want your server to run well under Windows.
Threads: threads run inside a single process and work on the basis of time-sharing: each thread gets a certain amount of time to do stuff. Threads are now used in Apache 2.x and are also common in Java application servers (which are themselves HTTP servers)
Threads have the advantage of having a lower cost to “create” (e.g. faster) than forking and it’s easier to “share” between threads (e.g. sharing a variable). Side note: when the Java guys say they’ve got a web server which performs better than PHP, they’re probably telling the truth (but remember performance != scaling)

On the downside some argue that threads are very tricky to code, with hard to debug problems like deadlocks and race conditions being too easy to create. This may only be an issue for the developers on the web server – you don’t need to push threads onto people writing apps to run under your web server – but the more complexity, the more bugs etc.

Also (more of an implementation detail), if each thread in the server is being given it’s own I/O stream for an incoming request, this is likely to gobble memory / resources plus most operating systems only support a limited number of threads running concurrently – for a serious discussion see The C10K problem (excellent read in general, in fact).

The other issue with threads and web servers is there’s a better chance of a given thread taking down the whole server, although that’s probably more of an implementation detail.
Asynchronous I/O: it’s common in programming to use sychronous (blocking) I/O – you read from a “stream” and your code (process) stops execution until the read is complete.
Asynchronous I/O uses non-blocking system calls to allow your code (process) to continue doing other things (e.g. more I/O) in parallel. Callbacks (or similar) are then only executed when a specific events happens (e.g. end of file). And these days we’re all familiar with this way of doing things thanks to AJAX right ? ;).

Perhaps the foremost example of async I/O is Python’s twisted framework, which I’d guess we hear more and more of in the next couple of years.

Async I/O is nice in that it does not have the limits threading does and probably results in more efficient use of resources. It may (depending on your API – at lower levels, it’s harder) also be easier to write code this way although it’s still not as easy as forking – much of what twisted does is about providing a nice API for async I/O, solving most concurrency issues for you so you can focus on higher level problems.

I guess you also have the risk that “user code” takes down the whole server with Async I/O – haven’t looked at how twisted deals with this – perhaps this is just implementation detail.

BTW you may also be surprised to note that more recent PHP versions also have some support for async I/O. See here (PDF) for more info.

Of course it’s definately not as clear cut as I’m suggesting. For starters, what type of developer you are will influence your world view: Linux kernel developers would see different problems and boundaries to language and library designers who in turn see a different light to application developers consuming the available APIs. And a given web server would likely use more than one approach – perhaps all three.

What does seem to be the case is async I/O is only now coming of age / popularity in web servers. Meanwhile FastCGI is back in demand and development, given Rails, web.py and similar. Despite that, mod_php still (today) represents the lesser of all evils for sysadmins – not the perfect solution (e.g. security headaches) but the best compromise all round – at least for the forseeable future.

BTW: if you’re feeling like another angle on PHP’s past see Adam Trachtenberg’s: The battle for middleware: PHP versus the world (PDF).

The sysadmin view on “Why PHP”