How to Ditch Scheduled Maintenance

This article was sponsored by PagerDuty. Thank you for supporting the sponsors who make SitePoint possible.

Scheduled website maintenance is the planned maintenance of a service, usually requiring a significant amount of time and manpower, as well as downtime for your service. It’s the internet equivalent of sending your car away for a tune up. The problem is, web apps are definitely not cars.

Let’s face it: no one loves scheduled maintenance — neither the customers nor the developers. It’s time-consuming, usually comes on a weekend, and potentially costs the company performing the maintenance money. Even if you perform your maintenance on a Sunday night in the United States, that’s early morning the next day in India, for example, so those users are affected by your outage.

Webmasters “schedule” such maintenance tasks to make major changes to websites. The only reason they do so even after knowing the potential effects is to avoid the interference they would face if their service was still working. Imagine a car mechanic trying to fix an engine when it’s still running.

Fortunately, modern development methods mean things have changed. This post will talk about ways to ditch scheduled maintenance and find a way to fix bugs in real time, without affecting a single user! Initially daunting, it is very achievable with proper planning and execution strategies.

Continuous Deployment — A World Without Scheduled Maintenance

Let’s start at the beginning and talk about the goal of scheduled maintenance. This is usually the time when you roll out major changes to your product — be it solving bugs or adding features. Usually, you’ll perform a bunch of changes during one session of maintenance to make the most of the downtime.

The other, better way of making these changes is simple — in real time. This is called continuous deployment. As soon as you identify a bug, you assign it to a developer to be solved. Once the bug has been rectified, you merge it with your main code base. It sounds pretty simple, but there are a few more steps involved.

Maintain a Staging Server

Many organizations maintain a staging server with a dummy database — one that acts as a link between your local machine and the main server. The primary motive of the staging server is to test how the changes would look to an end user. You’ll first preview the changes on the staging server and if you are satisfied with the changes, you’ll make them appear on the main server.

Access to the staging server must be restricted and this could be achieved by hosting it on an uncommon sub-domain (something other than “staging.yoursite.com”).

Automated Testing

You should ensure a proper testing framework (which includes tests like unit tests and integration tests) that a given fix should pass through before being merged with the main code base. This is important because the fix itself could introduce a new bug, which could lead to downtime. In many organizations, this testing is automated — your code is merged on the main server only when it has passed all the specified tests.

The idea here is to ensure that any bit of code that you add doesn’t end up breaking something.

Continuous Integration

Where does this automated testing take place? We need a centralized server that performs tests on the code before the changes go live. There are many open source solutions like Buildbot and cloud solutions like Travis CI or Jenkins CI which perform this task of automation.

I’m sure you use some kind of version control to manage your source code. In the popular version control system Git, you can write scripts called “hooks” to perform a range of tasks as per your requirements. The most common task would be to merge the code with the main repository once the tests have passed. Here is a detailed tutorial on the use of Git hooks for continuous integration.

Manage Changes to Database Schema Without Downtime

One problem that can really make scheduled maintenance tempting is the release of patches of features that require changes to your database schema. How do you proceed with such a release without any downtime?

It turns out there’s is a way around that too, although it does require some careful steps to be taken. You’ll need to follow roughly the following steps:

Modify Schema by creating a new table, or adding a new column (without deleting the old table or column)
Modify your application to read from the old data and write to both the old and new dataÃ‚Â
Migrate your data by copying the old data to the new schema
Modify your application to read and write only the new schema

You should proceed with caution at every step. Here is a more detailed tutorial.

Real Time Alerts and Situational Awareness

Even if your deployment system is perfect, bugs can still creep in. These bugs are often first encountered by the end users. Good samaritans will report it, but you may not know about the bug immediately.

It is therefore important to set up an alert system that lets you know as soon as an end user encounters an error. Ideally, this could be done by sending an email to your developers’ mailing list, or assigning someone to be on call using ChatOps.

It’s enormously helpful to have a system in place for getting a single view of your infrastructure, so you can keep up with and respond to bugs and downtime. PagerDuty is a platform that lets your on-call devs keep track of issues, with detailed analytics and integrations with apps like New Relic, Crashalytics and AppDynamics. During an incident, alerts will come through email, phone call, push notification, or through integrations with these other services. The service has continuous routing and automatic escalation to make sure every alert is given the attention it warrants.

Roll Out Features in Stages

We have been talking mostly about bugs so far. When it comes to releasing new features, companies like Facebook roll them out in batches (like the new Timeline or the Graph Search). This helps fix bugs in earlier versions of a new feature without affecting service too much.

In addition to fixing bugs, as the traffic to a new feature increases you can assess its performance and make the required changes for it to perform optimally.

Use High Availability Architecture

Your users are (hopefully) spread throughout the globe, and your application gets traffic all throughout the day. This makes uptime very crucial. Developers look for as high as 99.9999999% uptime, which points to less than a fraction of a second of downtime in a year.

With such requirements, high-availability approaches like load balancing and failover systems are needed. In addition to applying these principles on the application layer, you may also need to perform replication or sharding on your database as per your needs.

Final Thoughts

Before we finish off this post, let me give you a few examples of companies that not only perform continuous deployment, but also encourage others to do so — Facebook, Google, LinkedIn and PagerDuty (here’s a post about how the latter manages it). It just shows you that continuous deployment is the way ahead.

How did you make the move away from scheduled maintenance? Do you have any tips for making the transition easier?