Two Important Lessons from the AWS Failure

Last week Amazon Web Services experienced one of the largest disruptions the cloud computing service has seen. The failure caused a ripple effect across many websites such as Quora and HootSuite, which were completely down during the outage. Other websites, such as Reddit, were at least partially affected.

Failures happen, and as young as the cloud computing industry is I am positive this will not be the last, or the biggest, outage.

But there were several issues with this incident (on both sides) that can teach us all something about planning for and handling crises.

How You Communicate in a Crisis is Critically Important

Amazon is facing its share of criticism for the AWS outage, as it should be. But the harshest criticism I’ve seen isn’t about the length of the outage, or the fact that there was one in the first place. The worst criticism I’ve read was about Amazon’s lack of any response for more than 40 minutes after the outage began—an eternity when your website is down.

As the outage continued, many customers were upset that the updates appeared as if they had been written by the legal department (and judging by the delay in getting updates, they may have been) instead of being written by real human beings who were working to resolve the problem.

The lesson here is clear—when you have any kind of crisis, communication with those affected is extremely important. In emergency mode, it may not be possible to pick up the phone to talk to a client or customer, but updating your website or changing the voicemail message can have a major impact.

When you communicate with your customers about a problem, be honest and sincere. It’s amazing how much a little sincerity can do to appease an upset customer. 37 Signals is a great example. When their Basecamp service has an outage (which is extremely rare), they respond with a detailed explanation and a compassionate apology, which we’ve yet to see from Amazon.

Have a Contingency Plan

As I was hearing of some extremely large websites being completely down due to the AWS outage, I couldn’t help wondering why they built their systems without any redundancy or backup plan. Cloud computing is a relatively young industry, and although Amazon Web Services has been very reliable, failures happen. You wouldn’t stop backing up your computer just because you’d never had a hard drive crash—it’s just bound to happen sooner or later.

One of the biggest advantages of cloud computing is its rapid scalability. It is entirely possible to setup two completely separate cloud environments, one at AWS and one at Rackspace for instance, and simply have one be a backup ready to be scaled up to production when a failure occurs (either manually or automatically).

Now, what does this mean for you and me?

In my e-commerce business, we rely heavily on search engines to drive traffic (and revenues). In 2003, a Google update resulted in our website dropping from the top of page one to around page 50 for almost every major keyword phrase. Our traffic (and revenues) disappeared overnight. We scrambled to drive traffic through search marketing and other avenues, but it was too late by that point. It was our busy season and there were no customers.

Anytime you have the possibility for a single point of failure to cause a project or service to fail completely, you are just asking for trouble. I’ve seen it happen time and time again when companies hire a single contractor to program an application on a tight deadline, or rely on a single client for almost all of their income.

The solution in our e-commerce business was to diversify our marketing strategy. We still get a significant amount of traffic from search engines, but also drive revenues through search marketing, email marketing, social media and other forms of advertising. If another updates causes a drop, we will definitely be affected but it won’t be devastating.

Whether you have several contractors that can step in to help on projects in an emergency, or work to diversify your client roster so your biggest client doesn’t bring in the majority of your income, you should investigate ways you can limit the effect of any one incident on your business.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • kaf

    I think someone at Sony should have read this article. Especially the part about “How You Communicate in a Crisis…” Great advice.

    • Brandon Eley

      Thanks!

  • Textfriend

    Basic crisis communications is still something companies have trouble with. These things need to be considered, planned and tested in advance. Think of it like a ‘backup for your brand’.

  • http://www.optimalworks.net/ Craig Buckler

    Amazon certainly need to improve their crisis communications.

    However, one of the main benefits of cloud hosting is built-in redundancy. Sites are not hosted on a single server; should any point fail, other systems will take responsibility. The outage highlighted a critical problem with the technology, but it’s difficult to accuse companies of not having a contingency plan when cloud hosting offers that very service.

    • http://www.onsman.com Ricky Onsman

      That’s assuming what is sold as “cloud computing” actually is that. Some providers seem to use the term rather loosely.

  • Happy Hosting

    The cloud hosting should be provided with SAN and several hardware nodes. Some companies fail to do this. I hope Amazon provides it correctly.

  • Dave Doolin

    This is first huge scale outage I’ve heard Amazon having, so I’m not too worried that it took 40 minutes. 40 minutes really isn’t all that long in real time. Sure, it’s an eternity in internet time, but I suspect it took 20-30 minutes to figure out they had a real problem, and get the information to a decision maker.

    If it happens again… now that would be a problem!