“The best laid schemes of mice and men,
Gang aft agley”
Last week here at SitePoint we were very proud to relaunch the brand new SitePoint.com.
However, those who were looking at the website the week before would have noticed we actually launched the week prior, only to rollback after some problems. Yes, despite rigorous testing of a perfectly functional staging and production deployment that had been in use for over a week, our best laid plans certainly went agley.
We’d like to share some lessons we learned or had reinforced through the experience, to help out those who might be relaunching existing websites.
1. Have a status page ready to go
No-one wants a launch to go badly. But sometimes it does and when things are going south, you want to be able to quickly flip to a status page that is a little more pleasant for your visitors to see instead of a horrible server error message.
The current SitePoint status page is hosted as a GitHub page which allows us to have an externally hosted page that shouldn’t be affected by any main site downtime.
Test that you can switch to your status page quickly.
2. Always be able to rollback
Rolling back to a previous version, whilst not desirable, should always be an option. As our new setup was being run on totally different infrastructure, we could quickly and safely rollback to our old site by changing a few DNS entries. If you need to run migrations over the existing data set, make sure you’ve taken a snapshot before you start your migrations.
After we decided we needed to take more time to address the filesystem errors we were trying to fix, being able to roll everything back and get some sleep was very important.
Never put yourself in a position where your ONLY option is to fix a broken setup.
3. Load test over multiple pages
Load testing is really important, and before we did our first launch, we used the excellent Loader.io to do some benchmarking against the current site, and the new setup. This allowed us to spot some caching inefficiencies and correct them, getting the new SitePoint to consistently hit DomContentLoaded in under two seconds, which is a threefold improvement over the old site!
Unfortunately, one area in which we failed was load testing over multiple pages. All of your visits are not going to hit the same page, so your load testing should reflect this as well. Visiting multiple pages is also going to put all parts of your technology stack under test. In our case, the part of our stack that fell over wasn’t in use on the homepage, so load testing here was never going to show the critical problem that was very quickly found out when we pushed the go live button.
4. Load test until your site fails
As developers, we take a certain pride in knowing that what we build can take all kinds of stresses and load, and hate to think of our application falling over – that’s only natural. But, do you know how much load your application can take before it starts to split at the seams? And which part of your application will feel it first?
While we were rebuilding the shared storage part of the technology stack, we hit our deployment with a huge amount of traffic, until it fell over. This allowed us to know how much traffic we could sustain (well over 10 times our regular traffic), what part of the stack fell over when under that pressure (the load balancers) and what we would have to do and how long it would take to get it working again (around 15 minutes).
This kind of insight allows us to forward plan where we need to make improvements in our deployment, and we’ve already started our plans to reduce the complexity of our technology stack.
5. Never deploy in the late afternoon, or on Friday
This might sound like the most obvious advice in the world, right? I mean, who launches on a Friday or just before you’re about to head home? Right? Right?
Unfortunately, almost all of us have made this mistake at least once in our career. We test things for days and days, are working like madmen to get them out the door before a deadline, and before you know it, its 4pm. Your boss says to you, “We ready to go?”, and you reply with the kind of optimism that really should have been blunted from many years of experience. “Sure, we’re ready to go!”
So you push the go live button, things creak and strain, and look to be working fine. Congratulations are distributed all round and everyone goes home. A few hours pass, and then, everything starts happening.
After testing for numerous days, we pushed the button around 4.30pm on Wednesday afternoon, Melbourne time. That’s ahead of most of the time zones our users are in, from a few hours ahead of South East Asia through to 17 hours ahead of San Francisco.
The first signs that something were up came around 7.30pm when people first started reporting slowdowns, and random disconnects. Then the disconnects become less random and more common, and before you knew it, the whole site was unresponsive. After some diagnosing, it was found that our shared storage solution running DRBD locked up, causing anything that accessed files on it to also lock up. Eventually this meant all Apache threads become locked up and no more requests were served.
We worked on this problem for a few hours, trying to unlock the filesystem, and by around midnight the website was up and running again–for about 10 minutes. One of the DRBD nodes had a kernel bug that prevented any further saving, and at around 2.30am the tough call was made to rollback to the old website.
After spending Thursday and Friday working on a different solution to WordPress’ shared storage conundrum, we had another potential opportunity to launch the website on Monday afternoon. However, not wanting to make the same mistake twice, the decision was made to launch first thing Tuesday morning. This proved to be a wise move, as inevitably there were small things that needed fixing up, and this was much easier to do with the whole day ahead, rather than after hours post launch.
6. Make sure your servers can be brought up quickly and painlessly
In this age of launching applications from cloud services such as AWS and RackSpace Cloud, it is vitally important that you can bring up new servers with an absolute minimum of effort. Generally this means you’ve either baked a prebuilt ISO/AMI, and/or you use some combination of Chef, Babushka, Puppet etc.
For our new deployment we decided to use Salt which allows us to fire up new app/proxy/search/database nodes in minutes, and have them ready to slide into the stack as painlessly as possible.
As we re-tested our deployment, we made sure we were able to destroy and bring up new instances while the system was under stress testing. Once the site was live, we wouldn’t be able to ask all visitors to stop looking at it for a designated time period!
7. Understand what will break when you remove certain parts of your system
One of the biggest failings of our first attempt at launch was not understanding the consequences of a lockup on our shared storage node. Whilst we mitigated this by replacing that part of the infrastructure completely, we then went to great lengths to test what would happen if other parts of the setup went missing.
Of course, if you remove the database server, everything is going to fall over pretty quickly! But what happened when Memcached was no longer around? Or the ElasticSearch server disappears? By removing these nodes we ensured some level of resilience. Without Memcached, performance drops dramatically but still survives, meaning we have a window to get a new server operational. Without ElasticSearch we fall back to default WordPress search which while not as quick or nice, still works.
This kind of testing lets you perform practical dev-ops tasks such as bringing up new app nodes and adjusting configuration requirements. A model to consider is the Chaos Monkey introduced by Netflix to test system resilience and breakdown response times by randomly disabling production instances.
8. Accept mistakes, learn from them, be transparent
It is an unfortunate part of life that not all eventualities can be accounted for, and no matter how much you plan, some things might go wrong. It’s vitality important that if this does happen, a team can band together and fix the problem quickly and efficiently without any finger pointing or blame laying.
SitePoint is fantastic in this regard, and as soon as issues started to present themselves, a ready and willing army of workers, including previous alumni, came and tirelessly helped debug and engineer a different plan of attack for the eventual re-relaunch.
Also important is the engagement that you have with your customers. We are lucky enough to have a loyal and understanding userbase, and the feedback through the downtime and restructuring was almost all positive, with fellow developers understanding the troubles that can sometimes happen during a big deploy. Having said that, we also never tried to hide behind the mistakes we made, and did everything to make sure the second time we launched a success.
While the main thrust of these lessons may seem basic – test everything, don’t deploy at danger times – it is easy to gloss over some of the most obvious things if you are confident with your setup. As developers, we are often times amazingly optimistic in what we believe is achievable, and this can flow on to our faith in our infrastructure setups, leading to ignoring or putting aside well known guidelines.