After various comments to the effect of “well clearly Friendster don’t know Java” think this put an end to that line;
We had not one but TWO guys here who had written bestselling JSP books. Not that this necessarily means they’re great Java devs, but I actually think our guys were as good as any team.
Without wishing to add more fuel to fire, it seems that there are some J2EE developers out there who just don’t “get” PHP. Which is interesting in itself as what’s so hard to grasp about PHP? At some fundamental level there seems to be a difference in mind set between the J2EE guy and the PHP guy that means (at least in one direction) one can’t grasp that the other approach also works (and may actually work better).
CGI by Dummy
Perhaps the easiest way to say what PHP does on a web server (as an Apache module) is to compare it with CGI (everyone understands CGI right?).
Side Note / Disclaimer: I’m not the best qualified to talk about PHP request lifecycles, performance and scalability. PHP really needs Sterling, George and Rasmus to get together and write a detailed paper on how it works and why PHP scales so we can all live happily ever after.
My take on CGI is a request lifecycle looks like;
1. Apache receives request for page and sees it needs processing by CGI
2. Apache forks a process to handle the request (incurring overhead)
3. Whatever CGI binary performs it’s startup (more overhead)
4. The CGI binary processes the request an delivers a response to the browser
5. Process / CGI binary “dies” – return to step 1
With PHP (as an Apache module) it’s almost exactly the same except that PHP runs in-process meaning there’s no overhead for forking an external process and much less work that needs to take place in terms of PHP “startup”. In other words we’re talking only steps 1 and 4 above. Attempted to explain further here.
By contrast, an application running as Java servlet is memory resident. Where a PHP developer stores session information in a file or a database, a Java developer may put it in memory. That’s an important point to understand the difference between the Java and PHP paths.
Found a fairly useful paper from 2000 Performance Comparison Of Alternative Solutions For Web-To-Database Applications which discusses in some more detail. It comes out in favour of servlets when considering performance and with key phrases like “Since servlets are written in the highly portable Java language and follow a standard framework, they provide a means to create sophisticated server extensions in a server and operating system independent way.” be warned that here’s someone who’s read the label.
But here’s the big point;
Scalability != Performance (!!!)
Unfortunately the author of The PHP Scalability Myth got that wrong and I’ve been seeing comments to the same effect on Friendster.
Yes the subjects are related but scalability is more about what happens when you add more resources and how that increases the volume of requests your application an handle. See Wikipedia on Scalability. Typically (if you didn’t plan in advance) you start thinking about scaling when performance starts to drop off due to increased load.
That a Java servlet performs better than a PHP script, under optimal conditions (e.g. plenty of free memory) is nothing to do with scalability. The point is can your application continue to deliver consistent performance as volume increases – can you maintain performance by adding hardware, for example?
In other words “This page takes 0.5 seconds to complete it’s response. Can we preserve that performance with another 500,000 hits a day?” (scalability) is what we’re interested in not “This page takes 0.5 seconds. How can be reduce that to 0.1?” (performance).
Generally people talk about two types of scalability – vertical (adding new processors, disk, memory to your existing “big box” or buy a “bigger box”) and horizontal (add extra “boxes” and distribute load between them).
Vertical scalability is easy to implement but generally more expensive long term. There’s typically a limit to how much memory, for example, you can add to your existing “box” and the “next box up” costs three times the price but only increases capacity by 20%.
Horizontal scalability takes more effort / cunning but can prove extremely successful, as mused in The Secret Source of Google’s Power – build a “super computer” out of dirt cheap parts. For filesystem (and by extension database) replication across multiple systems there’s generally a range of mature solutions out there to choose from, many Open Source. Memory replication is another story (now go back to that important point up there).
Who do you trust?
One of the comments here pointed out that “J2EE can run in a cluster”, taking care of memory replication for example.
Right here is where you need to ask “what is Java?”. Is it just a programming language? Or is the runtime + libraries an Operating System?
Reading Twelve rules for developing more secure Java code, tips like “Make your classes nonserializeable: Serialization is dangerous because it allows adversaries to get their hands on the internal state of your objects.”.
This security tip simply does not compute in PHP – it’s all or nothing. Those I expose my objects to I implicitly trust (ignoring RPC for a moment). Sure there are some PHP frameworks out there but you wont find hosts providing them as a service.
There in lies my own fundamental problem with the “Java way” and app servers like JBoss. They seem to reinvent a whole bunch of wheels (particularly where replication is concerned) which are already well charted territory. And J2EE is only about four years old.
As Rasmus keeps saying when talking about “shared nothing”, PHP delegates all the “hard stuff” to other systems. Apache (which I think it’s safe to say can be trusted) takes care of handling requests, forking children when needed. Tools like Squid are recommended by Rasmus for balancing the load.
When it comes to session data I’m much more willing to believe in filesystem or database clustering than J2EE clustering, the key points being maturity both in the mechanism and the tools that support it (for sysadmins). There’s also some serious weight going into Linux clustering which presents another path for PHP while other alternatives like MSession or memcached (pecl::memcache) also exist.
An amusing read is Why Java Sucks For Sysadmins with some very valid points like;
java.io.FileNotFoundException: somefile (No such file or directory)
at java.io.FileInputStream.open(Native Method)
Warning: file(somefile): failed to open stream: No such file or directory in /home/hfuecks/scripts/reader.php on line 10
$ man cron
“4th Berkeley Distribution 20 December 1993″
Mind set Discontinuity
In responding to Rasmus’s comment on shared nothing / infinite horizontal scalability, someone called Mark wrote;
“Rasmus, your post is the very reason _not_ to use PHP. You’re pushing session state, inter-process messaging, and application state off to a database. For Friendster’s sake, I hope they have a huge clustered Oracle instance, because once they exceed the capabilities of their database, the site will fall apart.
JSP is far, far more efficient than PHP when it comes to taking load off the database, for the very reasons you mentioned above. Sandboxing every request is inherently a mistake, because to do any sort of OO would require that you load up the user’s profile every single time you hit a page.”
The typical PHP approach would be to store a user’s profile as part of the session data, which contains everything relevant to that session – you load this once per request and populate any necessary objects with it. Sure the DB call (if that’s what you’re using) is overhead but it’s manageable overhead. Further DB calls (e.g. fetching content) can be eliminated by smart use of caching.
“If the average page in your Web application contains nine images, then only ten percent of the requests to your Web server actually used the persistent connections they have assigned to them. In other words, ninety percent of the requests are wasting a valuable (and expensive, from a scalability standpoint) Oracle connection handle. Your goal should be to ensure that only requests that require Oracle connectivity (or at least require dynamic content) are served off of your dynamic Web server. This will increase the amount of Oracle-related work done by each process, which in turn reduces the number of children required to generate dynamic content.
The easiest way to promote this is by off loading all of your images onto a separate Web server (or set of Web servers).”
This tip is specific to an application and the environment it’s running in. I’d argue that PHP developers think about applications this way; each one is unique and, when it comes to scaling, a unique set to solutions is required.
Meanwhile the Java approach is looking for a “one size fits all” solution – something that will take you away from specific solutions like this and give you an environment you can “fire and forget” in. While that’s an admirable goal, it requires (and has) reinvented a lot of wheels and requires the fundamental belief that software can be mass produced and still meet requirements. While thinking that way the PHP approach seems out of mental range for the J2EE guy.
Ultimately think this a showdown between “Process / Fork” (LAxP) vs. “Runtime / Thread” (J2EE / .NET). When asking “does PHP scale?” you’re really asking “does Process / Fork + X persistent store scale?” . In many ways that questioning whether *Nix scales…