Last week Amazon Web Services experienced one of the largest disruptions the cloud computing service has seen. The failure caused a ripple effect across many websites such as Quora and HootSuite, which were completely down during the outage. Other websites, such as Reddit, were at least partially affected.
Failures happen, and as young as the cloud computing industry is I am positive this will not be the last, or the biggest, outage.
But there were several issues with this incident (on both sides) that can teach us all something about planning for and handling crises.
How You Communicate in a Crisis is Critically Important
Amazon is facing its share of criticism for the AWS outage, as it should be. But the harshest criticism I’ve seen isn’t about the length of the outage, or the fact that there was one in the first place. The worst criticism I’ve read was about Amazon’s lack of any response for more than 40 minutes after the outage began—an eternity when your website is down.
As the outage continued, many customers were upset that the updates appeared as if they had been written by the legal department (and judging by the delay in getting updates, they may have been) instead of being written by real human beings who were working to resolve the problem.
The lesson here is clear—when you have any kind of crisis, communication with those affected is extremely important. In emergency mode, it may not be possible to pick up the phone to talk to a client or customer, but updating your website or changing the voicemail message can have a major impact.
When you communicate with your customers about a problem, be honest and sincere. It’s amazing how much a little sincerity can do to appease an upset customer. 37 Signals is a great example. When their Basecamp service has an outage (which is extremely rare), they respond with a detailed explanation and a compassionate apology, which we’ve yet to see from Amazon.
Have a Contingency Plan
As I was hearing of some extremely large websites being completely down due to the AWS outage, I couldn’t help wondering why they built their systems without any redundancy or backup plan. Cloud computing is a relatively young industry, and although Amazon Web Services has been very reliable, failures happen. You wouldn’t stop backing up your computer just because you’d never had a hard drive crash—it’s just bound to happen sooner or later.
One of the biggest advantages of cloud computing is its rapid scalability. It is entirely possible to setup two completely separate cloud environments, one at AWS and one at Rackspace for instance, and simply have one be a backup ready to be scaled up to production when a failure occurs (either manually or automatically).
Now, what does this mean for you and me?
In my e-commerce business, we rely heavily on search engines to drive traffic (and revenues). In 2003, a Google update resulted in our website dropping from the top of page one to around page 50 for almost every major keyword phrase. Our traffic (and revenues) disappeared overnight. We scrambled to drive traffic through search marketing and other avenues, but it was too late by that point. It was our busy season and there were no customers.
Anytime you have the possibility for a single point of failure to cause a project or service to fail completely, you are just asking for trouble. I’ve seen it happen time and time again when companies hire a single contractor to program an application on a tight deadline, or rely on a single client for almost all of their income.
The solution in our e-commerce business was to diversify our marketing strategy. We still get a significant amount of traffic from search engines, but also drive revenues through search marketing, email marketing, social media and other forms of advertising. If another updates causes a drop, we will definitely be affected but it won’t be devastating.
Whether you have several contractors that can step in to help on projects in an emergency, or work to diversify your client roster so your biggest client doesn’t bring in the majority of your income, you should investigate ways you can limit the effect of any one incident on your business.