8 Things We Learned from Relaunching SitePoint

This entry is part 2 of 4 in the series Redesigning SitePoint

Redesigning SitePoint

“The best laid schemes of mice and men,
Gang aft agley”

Last week here at SitePoint we were very proud to relaunch the brand new SitePoint.com.

However, those who were looking at the website the week before would have noticed we actually launched the week prior, only to rollback after some problems. Yes, despite rigorous testing of a perfectly functional staging and production deployment that had been in use for over a week, our best laid plans certainly went agley.

We’d like to share some lessons we learned or had reinforced through the experience, to help out those who might be relaunching existing websites.

1. Have a status page ready to go

No-one wants a launch to go badly. But sometimes it does and when things are going south, you want to be able to quickly flip to a status page that is a little more pleasant for your visitors to see instead of a horrible server error message.

The current SitePoint status page is hosted as a GitHub page which allows us to have an externally hosted page that shouldn’t be affected by any main site downtime.

Test that you can switch to your status page quickly.

2. Always be able to rollback

Rolling back to a previous version, whilst not desirable, should always be an option. As our new setup was being run on totally different infrastructure, we could quickly and safely rollback to our old site by changing a few DNS entries. If you need to run migrations over the existing data set, make sure you’ve taken a snapshot before you start your migrations.

After we decided we needed to take more time to address the filesystem errors we were trying to fix, being able to roll everything back and get some sleep was very important.

Never put yourself in a position where your ONLY option is to fix a broken setup.

3. Load test over multiple pages

Load testing is really important, and before we did our first launch, we used the excellent Loader.io to do some benchmarking against the current site, and the new setup. This allowed us to spot some caching inefficiencies and correct them, getting the new SitePoint to consistently hit DomContentLoaded in under two seconds, which is a threefold improvement over the old site!

Unfortunately, one area in which we failed was load testing over multiple pages. All of your visits are not going to hit the same page, so your load testing should reflect this as well. Visiting multiple pages is also going to put all parts of your technology stack under test. In our case, the part of our stack that fell over wasn’t in use on the homepage, so load testing here was never going to show the critical problem that was very quickly found out when we pushed the go live button.

4. Load test until your site fails

As developers, we take a certain pride in knowing that what we build can take all kinds of stresses and load, and hate to think of our application falling over – that’s only natural. But, do you know how much load your application can take before it starts to split at the seams? And which part of your application will feel it first?

While we were rebuilding the shared storage part of the technology stack, we hit our deployment with a huge amount of traffic, until it fell over. This allowed us to know how much traffic we could sustain (well over 10 times our regular traffic), what part of the stack fell over when under that pressure (the load balancers) and what we would have to do and how long it would take to get it working again (around 15 minutes).

This kind of insight allows us to forward plan where we need to make improvements in our deployment, and we’ve already started our plans to reduce the complexity of our technology stack.

5. Never deploy in the late afternoon, or on Friday

This might sound like the most obvious advice in the world, right? I mean, who launches on a Friday or just before you’re about to head home? Right? Right?

Unfortunately, almost all of us have made this mistake at least once in our career. We test things for days and days, are working like madmen to get them out the door before a deadline, and before you know it, its 4pm. Your boss says to you, “We ready to go?”, and you reply with the kind of optimism that really should have been blunted from many years of experience. “Sure, we’re ready to go!”

So you push the go live button, things creak and strain, and look to be working fine. Congratulations are distributed all round and everyone goes home. A few hours pass, and then, everything starts happening.

After testing for numerous days, we pushed the button around 4.30pm on Wednesday afternoon, Melbourne time. That’s ahead of most of the time zones our users are in, from a few hours ahead of South East Asia through to 17 hours ahead of San Francisco.

The first signs that something were up came around 7.30pm when people first started reporting slowdowns, and random disconnects. Then the disconnects become less random and more common, and before you knew it, the whole site was unresponsive. After some diagnosing, it was found that our shared storage solution running DRBD locked up, causing anything that accessed files on it to also lock up. Eventually this meant all Apache threads become locked up and no more requests were served.

We worked on this problem for a few hours, trying to unlock the filesystem, and by around midnight the website was up and running again–for about 10 minutes. One of the DRBD nodes had a kernel bug that prevented any further saving, and at around 2.30am the tough call was made to rollback to the old website.

After spending Thursday and Friday working on a different solution to WordPress’ shared storage conundrum, we had another potential opportunity to launch the website on Monday afternoon. However, not wanting to make the same mistake twice, the decision was made to launch first thing Tuesday morning. This proved to be a wise move, as inevitably there were small things that needed fixing up, and this was much easier to do with the whole day ahead, rather than after hours post launch.

6. Make sure your servers can be brought up quickly and painlessly

In this age of launching applications from cloud services such as AWS and RackSpace Cloud, it is vitally important that you can bring up new servers with an absolute minimum of effort. Generally this means you’ve either baked a prebuilt ISO/AMI, and/or you use some combination of Chef, Babushka, Puppet etc.

For our new deployment we decided to use Salt which allows us to fire up new app/proxy/search/database nodes in minutes, and have them ready to slide into the stack as painlessly as possible.

As we re-tested our deployment, we made sure we were able to destroy and bring up new instances while the system was under stress testing. Once the site was live, we wouldn’t be able to ask all visitors to stop looking at it for a designated time period!

7. Understand what will break when you remove certain parts of your system

One of the biggest failings of our first attempt at launch was not understanding the consequences of a lockup on our shared storage node. Whilst we mitigated this by replacing that part of the infrastructure completely, we then went to great lengths to test what would happen if other parts of the setup went missing.

Of course, if you remove the database server, everything is going to fall over pretty quickly! But what happened when Memcached was no longer around? Or the ElasticSearch server disappears? By removing these nodes we ensured some level of resilience. Without Memcached, performance drops dramatically but still survives, meaning we have a window to get a new server operational. Without ElasticSearch we fall back to default WordPress search which while not as quick or nice, still works.

This kind of testing lets you perform practical dev-ops tasks such as bringing up new app nodes and adjusting configuration requirements. A model to consider is the Chaos Monkey introduced by Netflix to test system resilience and breakdown response times by randomly disabling production instances.

8. Accept mistakes, learn from them, be transparent

It is an unfortunate part of life that not all eventualities can be accounted for, and no matter how much you plan, some things might go wrong. It’s vitality important that if this does happen, a team can band together and fix the problem quickly and efficiently without any finger pointing or blame laying.

SitePoint is fantastic in this regard, and as soon as issues started to present themselves, a ready and willing army of workers, including previous alumni, came and tirelessly helped debug and engineer a different plan of attack for the eventual re-relaunch.

Also important is the engagement that you have with your customers. We are lucky enough to have a loyal and understanding userbase, and the feedback through the downtime and restructuring was almost all positive, with fellow developers understanding the troubles that can sometimes happen during a big deploy. Having said that, we also never tried to hide behind the mistakes we made, and did everything to make sure the second time we launched a success.

While the main thrust of these lessons may seem basic – test everything, don’t deploy at danger times – it is easy to gloss over some of the most obvious things if you are confident with your setup. As developers, we are often times amazingly optimistic in what we believe is achievable, and this can flow on to our faith in our infrastructure setups, leading to ignoring or putting aside well known guidelines.

Redesigning SitePoint

<< Redesigning SitePoint: the Design Process Insider ViewWordPress Multi-Environment: Setting Up SitePoint >>

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Kris

    I am not happy with your redesign. Same happen with MSDN. You guys have gone to 2023 not 2013. All thing look useless to me.

    Before now I can read better now I never got a better read with your current design. All thing you done look like Magic or Red-bull or bear you drink.

    Nothing but wasting of UI sense. Do you have read a joke-book or play a fancy game on phone before taking design of this layout.

    Look like a big crap for me.

  • Shayne Tilley

    That’s a positive outlook there Kris. I’d like to know what you mean by how the old design was easier to read. Smaller fonts more clutter must be your thing. I had zero to do with the redesign, but wow, way to be constructive.

  • Tim Igoe

    A very good article for anyone planning site works, even from a small scale level (where the infrastructure isn’t as great) the same things can still apply.

  • Anonymous

    Good article. Most of the advice is obvious, but it’s often the most obvious things that are neglected. Regarding the design itself, I think it’s a bold new direction and certainly looks to fall in line with the currently en vogue “mobile first” approach. I feel that readability has gone up dramatically.

    One little feature request would be a button somewhere to switch to light text on a dark background.

  • Anthony

    As we’ll shortly be doing something very similar, I found this article to be very useful. As mentioned by @Kevin, most is obvious, but it’s great to have a reminder and to be able to learn from others experiences.

    Personally I’d love to know a little more about your architecture, if possible. For example, I see that you use DRBD, which we were previously considering but didn’t implement in the end. It would be great for us to understand what you use, why, and maybe which alternatives were considered and why they were rejected.

    By the way, I love the redesign, but I always turn off the resize:none; that is applied to the comment box I’m typing in at the moment because it’s too small by default to be useful ;)

    • Michael Sauter

      @Anthony: We’re not using DRBD anymore. We were using it when we first attempted to launch. Then the DRDB instance locked up and brought our app servers down. We then decided to not use a shared storage at all. For everything stored on there, we choose one of the following 3 options:
      1) Serve it via S3/CloudFront
      2) Store it on all app servers
      3) Remove it

      I was quite surprised that it actually wasn’t that difficult to do without a shared storage. And so far, we’re really happy with the new setup. Less points of failure, and less load on our infrastructure due to CloudFront.

      • Anthony

        Thanks Michael. We did much the same ourselves. S3/CloudFront make an excellent combination for serving static content

  • Anonymous

    Thanks for the sharing. I found [5. Never deploy in the late afternoon, or on Friday] the best advice.

  • Mary

    The old Site Point design was more contained and was easy to read and navigate. The very large font size on the new design was too large to read in one glance. I had to move my head from side to side to read large blaring headlines. Reducing the font size helped but then the next Web site I went to was much too small. There is a comfort level when it comes to font size in relation to the screen size. You probably want to be on top of all the latest design trends but most design trends have a way of missing something important, I have found, and eventually fall out of favor.

    All of us can learn something from constructive criticism, more than from all the approvals and “well dones.”

  • Anonymous

    I was happy with your Facebook attention, I gave some positive and negative feedback and felt like I had a good chat with “hawk” about it…. It’s refreshing when companies are responsive on their social profiles.

  • chutuoc

    i think the old design is better.

  • Ivan K

    Good redesign, things look more clear & simple. Keep going!

  • Deb

    in the process of a major site relaunch. W go live in 2 weeks thanks for the info.. will take point 4 into consideration.

  • Amit

    >> Eventually this meant all Apache threads become locked
    >> up and no more requests were served.

    I’m curious to know why are you guys still using Apache for such a high traffic site. Wouldn’t it be better to use Nginx + PHP F-CGI for performance?

  • Charles

    I don’t like the New Relic ad constantly bugging me at the top – really annoying. I like the site being organized by language and topic but it does look very blocky. The site links at the bottom are weird, and why is that the only place where you can find a link to the Forums – one of THE most important place to go at Sitepoint.

  • Anonymous

    Overall a valuable list of ‘lessons’. Kudos to Site Point for having the humility to share the details.

    The summary says it all:
    “As developers, we are often times amazingly optimistic in what we believe is achievable, and this can flow on to our faith in our infrastructure setups, leading to ignoring or putting aside well known guidelines.”

    It seems, that optimism is our greatest strength, and out greatest weakness. As developers we live by believing we can accomplish the ‘impossible’. The key is knowing when to set aside that optimism, expect and be prepared for the worst.

  • Sol

    Point 5 resonates with me…. so easy to “want” to go live without full soft launch testing.
    My number 9 would be: soft launch to a small audience first on the dev site.

  • lemon23

    I’m viewing this on Windows Phone 7 with IE9 mobile. A lot of the navigation and social icons appear as a grey rectangle. Or sometimes a rectangle inside a box of colour. It’s almost like a font is missing, but there’s no fallback for it

  • craig

    Aside from the technical aspects, there is something SERIOUSLY lacking in the structure of the new site. You need to fire some geeks and hire someone with a sense of style.

  • Anonymous

    Jude. Nice write-up here with some excellent points.

  • Nelson

    I feel that this redesign doesn’t make justice to Sitepoint. The menus are lacking and the section buttons are huge!
    It was easier to find information with the former design (where is the podcast archive?).

    Sitepoint for me was about the articles but now it feels just like many other tutorial sites out there that I don’t give a damn about.

    This design looks like it was made to be consumed on tablets but how many developers switch to a tablet just to browse the web when they spend most of their time on a notebook?

    Come on Sitepoint, you can do better than this!