Horizontal Scaling of PHP Apps, Part 2

Bruno Skvorc

In part 1, we discussed the horizontal scaling of the application layer – scaling out servers to handle the PHP code concurrently via load balancers and other means. We covered sharing of local data, touched on potential optimization, and explained the basics of load balancers.

The application layer, however, isn’t the only thing that needs scaling. With a bigger demand for our app, the demand for higher read/write operations on the database also surfaces. In this part, we’ll look at scaling the database, explain replication, and cover some common pitfalls you might want to avoid.


As in Part 1, we need to mention optimization first. Indexing your database properly, making sure the tables consist of the least amount of important data and keeping the secondary information in others (users_basic + users_additional + users_alt_contacts, etc – known as database sharding – a complex topic warranting its own article for sure), doing small, atomic queries as opposed to large on-the-fly calculations – all those methods can help you speed up your databases and avoid bottlenecks. There’s one aspect that can help even more, though – query cache.

Database servers typically cache results and compiled SELECT queries that were last executed. This allows a client (our app) to receive the data from cache instead of having to wait for execution again. Such an approach saves CPU cycles, produces results faster, and frees the servers from doing unnecessary calculations. But the query cache is limited in size, and some data sets change more often than others. Using the cache on all of our data would be ludicrous, especially if some information changes faster than some other information. While the cache can be fine tuned, depending on the database vendor, there’s an approach that lends itself quite nicely to the query cache solution – contextual grouping of servers.

When you group servers, you have groups of servers each dedicated to a specific part of your app. Say I’m making an online gambling website – I could have one group for in-game chats, one group for ongoing games, one group for user accounts. The user accounts group doesn’t take that many reads or writes, at least not as many as the chats or games functions do, so we can make it smaller – have 5 servers there. In the chats group, we can have 10 servers, because the real time communication of the millions of our users is very important for a high quality gambling experience. But there’s no aspect as important as the actual games – our bread and butter – so those get an exceptionally large group, maybe 20 servers.

This makes sure you do all your user related reads and writes in one group, not touching the others. The games need to be fast, so the largest group is used to make sure they’re served fast (both read and write operations), while the user account page isn’t all that important to people who are already logged in and preparing to change things – we’ve already captured their attention and don’t need to impress them with super-fast responsiveness on the non-business-critical parts of our app.

All groups share the same total data (see MSR below), so you can move servers from one group to another – say there’s a huge poker championship and you’ve got millions of users watching the game – the game group could benefit from 4-5 servers taken from the chat group to make sure it can handle the load. That said, there are built in solutions for clustered databases, and one such solution is Master Slave Replication.

Master-Slave Replication

Master-Slave replication (MSR) is a very common feature in modern databases, often built-in. Search the documentation of your particular database vendor for information on whether or not it’s natively supported. MSR is the process of sending all database write operations from one server (the master) to one or more slaves. The slaves then replay the queries, and thus replicate the master’s data. Integrated with a web app, this is how it works, step by step:

  • a visitor performs a write operation on a database. For example, he changes some profile information.
  • the application sends the query to the master database server, as usual
  • the master executes this query and forwards it to all the slaves
  • the slaves execute the query, and the same data is now in both master and slave machines
  • further read operations are performed on the slaves

The final step is what’s important here – we perform read operations on the slaves. This by itself is division of labor between machines – with the Master free to do only writes (generally, there are far fewer writer than reads in a web app, and as such one Master is often enough), and an army of slaves available for fetching, no particular server is overburdened.

MSR is usually activated by default in today’s installations of MariaDB and MySQL.

Separating reads and writes

This is where a slight modification to your architecture is required, its complexity being reciprocal to the quality of your code and use of a framework or a good project structure. When needing to separate reads and writes, you’ll need to make separate connections. For writing, you need to connect to a “write DB” (the master), and for reads, you’ll need to connect to one of the slaves. This can, of course, be done manually. For writing, you would connect to a specifically defined master configuration:

$config = $this->getService('configProvider');

$master = new PDO(
$statement = $master->prepare('SOME QUERY FOR UPDATING');

While for reading, a new connection would be required, through a different set of config values:

$config = $this->getService('configProvider');

$slave = new PDO(
$results = $slave->query('SOME QUERY FOR READING');

If we have multiple slaves, and we know their host addresses, we can do some randomization:

$config = $this->getService('configProvider');

// Pick random slave from list of possible slaves
$slaveConfig = $config->db->slaves[array_rand($config->db->slaves)];

$slave = new PDO(
$results = $slave->query('SOME QUERY FOR READING');

Realizing this is awkward to do every time you need to do reads or writes doesn’t take long. A better solution would be to pre-init a master DB connection in a service container of some sort, and then just access that one for writes, without the need to re-establish connection every time you need writes. As for slaves, the same approach can apply – build the randomization function into a “Slave” DB adapter, and just call something like $this->getService('db')->slave, which automatically returns a randomly selected one. This way, you never have to worry about manually selecting them ever again, and as you add more slaves to your cluster, just include their host address into the configuration file.

Want to take it up another notch? Have the slave-fetching service select only those slaves that aren’t taxed or down at the moment – make the slaves report their CPU/RAM usage somewhere repeatedly, and make sure the slave fetching class always selects the one that’s under the least amount of fire. This abstract selector class should also make sure it connects to another slave if the one it originally tried to connect to is down – you need the database reads to remain transparent to the end user, but you still want the admins to know there’s a problem with one of the slaves:


// ... some class, some connect() method

$config = $this->getService('configProvider');
$mailer = $this->getService('mailer');
$validSlaves = $config->db->slaves;
$bConnectionSuccessful = false;

while (!empty($validSlaves) && !$bConnectionSuccessful) {
    $randomSlaveKey = array_rand($validSlaves);
    $randomSlave = $validSlaves[$randomSlaveKey];
    try {
        $slave = new PDO(
        $bConnectionSuccessful = true;
    } catch (\PDOException $e) {
        $mailer->reportError( ... ); // Some kind of email sent to the admins, or the master log, etc
    } catch (\Exception $e) {
        $mailer->reportError( ... ); // Some kind of different email sent to the admins, or the master log, etc

if ($bConnectionSuccessful) {
    return $slave;

$mailer->reportError( ... ); // Report that the entire attempt to connect to any slave failed
throw new \Exception( ... ); // or report via an exception with details in the message

This is only pseudocode, but you get the gist – try to connect to a randomly selected slave, if it fails, remove that slave from the array of valid slaves, report the problem, and try again on another slave for as long as there are more to try.

Read/write sync delay

One thing that might pose a data consistency threat is the sync delay between masters and slaves. On smaller writes, the replication to slaves will be near instant, and bigger queries will, naturally, need longer to execute – this delay also increases as the slave army grows, because the master has more query copies to disperse – some slaves will be ready sooner than others. However, no matter the quality of your cluster’s network, the speed of the individual machines, or the size of the queries, there is no setup in the world able to accurately do the following:


$db = $this->getService('db');
$master = $db->get('master');
$slave = $db->get('slave');

// value is currently 1
$master->exec('UPDATE `some_table` SET `value` += 1 WHERE `id` = 1'); // executes
$slave->query('SELECT `value` FROM `some_table` WHERE `id` = 1'); // value is still 1

These statements are executed too close to each other, and there is simply no way for the slave to get the update in time to accurately reflect it.

There are various workarounds for this, though none are failproof. This inconsistency is something you just have to accept. In most cases, data that was written doesn’t need to be immediately read. If it does, it’s best to just take an approximation – if you expect the value to increase by 1, then just read the original value from the slave, and add 1 to the result, then display it to the user. The master can still propagate the actual update in the background.

Master fail

What if a master fails, though? Does the whole system grind to a halt? Well, it shouldn’t. Solutions exist for master failovers, and they usually solve the problem by turning a slave into a master if the master fails. Additional architecture changes are needed, however, to make this possible. The procedure is as follows:

  • A master fails and can no longer accept writes
  • Our master fetching script recognizes this, reports an error, and selects a random available slave
  • This slave is given the signal to become the master
  • All other slaves are given the signal to switch to the new master
  • A small data loss is possible, if the master died mid-write and didn’t have time to propagate the data change to all slaves, or if the slave didn’t have time to execute the query before being turned into a master
  • When the master is brought back online, it should be destroyed and set up from scratch as the slave to the new master – doing anything else would be pointless, because catching up to the missed queries from during the downtime would be a fool’s errand. If the failed master machine is the most powerful machine and you really want it to continue being the master, additional safety steps need to be taken which will probably warrant a short downtime until the re-booted master catches up to everything.

In MySQL and MariaDB in particular, you can make a slave switch its master with the CHANGE MASTER TO command, while promoting from slave to master is done via STOP SLAVE and RESET MASTER. For more information on master switching, please see the docs.


In Part 2, we covered database replication and clustering of databases. With both these parts now behind you, hopefully you’ve gathered enough initial knowledge to proceed on your own and build an excellent scalable environment. Do you have any suggestions you think are crucial in horizontal scaling? Would you like to see a more complex part 3? Have we missed some critical aspects? Let us know in the comments below!