Not really, it's mostly the programmer's "fault". But anyway...
Originally Posted by Daijoubu
Maybe 10 or 20 years ago that was true, but I have, right now, a single database with 26 tables (due to file size limitations more than anything else) each containing ~ 500,000 - 1 million records serving content from a CMS. I can assure you that the DB handles this quite well, without any issue. Putting those in a separate system would make things quite slow, cumbersome, and difficult to manage.
>> (try that on a table with a million entries, then come and tell me it was a good idea).
Try telling any data guru that having a table with a million entries in your OLTP is a good idea
You should never really have to worry about extremely large data sets in your OLTP, because you should never really have extremely large data sets in your OLTP. Your OLTP should really only contain what is necessary for the online system. Everything else should be offloaded, and maybe even de-normalized, to a data warehouse.
But you're really going it about the wrong way if you find yourself with millions of rows in your online system.
Everybody seems to talk about "enterprise programming" but then completely ignore aspects of "enterprise databasing" (I know that's not a real word, but you know what I mean).
Our user database is fast approaching a million users who have been active in the last 30 days as well.
The query that I gave as an example was from a real-world situation. We want to show 10 random articles from the database (the 26 content tables are managed with a MERGE table). The only solution that gives good, truly random results, was to do this:
1.) Count the number of records.
2.) use mt_rand(0,$RecordCount);
3.) Grab the records using the randomly chosen ID.
Like I said -- order by rand(), limit 10 would never be more efficient (even if you were dealing with a small data set, such as a few thousand rows).
Of course, not every application is (or should be) designed the same. Some applications need the database as lean as possible; others perform better when all the data is accessable all the time. Blanket statements like "never" or "always" are the first signs of bad design.
Oh, and I'd never use any of the existing open source CMS systems for the sites that I run. Have you ever tried to move 20,000 articles from one section of your site to another in those things? Yikes!
Even a few simple mistakes can choke scalability, actually. The two most obvious examples are vBulletin and MediaWiki. Both are very popular, very well-regarded, and very poorly written pieces of software that scale extremely poorly. Yes, you can just throw more web servers at the problem (as most people do), but writing a more efficient code base fromt he ground up would have saved thousands upon thousands (or millions, in the case of mediawiki) of dollars in hardware and maintenance costs.
Any PHP application is inherently scalable (unless it has been very very very poorly coded). It is not PHP's job to scale the system; it will naturally scale with your system
Obviously there's a certain point where limitations beyond php's control start to hit you (once you get over 10,000 simultaneous connections or so, it's cheaper to just buy a second web server most of the time, since apache can't really do much better than that on any reasonably-priced current hardware; I use a dual opteron w/ 4GB of RAM as my benchmark of "reasonable").
However -- most people don't get those kinds of numbers. I'm still baffled as to why it's more or less impossible to run vBulletin on a single machine once you have more than 2-300 concurrent connections going on (note: vbulletin's "current users online" is not a measure of concurrency, it's a measurement of users online in the last 10 or 15 minutes. and there's a world of difference). Of course, most sites can fall back to using things like Tux, which can easily toss out 25k+ pages per second without batting an eye.
Rarely, and it depends on what you're actually doing.
Of course there are also situations where it is ridiculous to even think about running an application on only a handful of servers, it's clear you are going to need a whole park of servers
If it makes the difference between buying 100 servers and buying 110, it matters significantly to most every company out there. Servers are cheap, individually, but the numbers quickly add up, especially for the 95%+ of web companies out there who are not multi-billion dollar operations.
fter all it doesn't really matter much, whether you are going to use 60 or 150 machines.
That doesn't help you one bit for replication of individual databases on most platforms.
an example: queries should follow the format db_name.table_name for replication of individual databases.
I generally say it's best to leave the scaling to the db server (mysql, postgresql, oracle, and microsoft sql server all do this naturally. I'm quite positive that the majority of other db platforms out there do as well, those are just the ones I've used personally). Attempting to implement a custom clustering system in your code is, at best, messy, and, at worst, dangerous.
And scaling databases is easy otherwise. Simple round-robin setup for a read-only database:
class DB extends mysqli
private static $Connections = array(
$ConnectionData = self::$Connections[mt_rand(0,count(self::$Connections))];