Reliability in the Cloud: Asking the Right Questions

In 2011 the general public was introduced to the Cloud. Unfortunately, in many cases this introduction came as a result of the Cloud’s failures. Amazon’s April outage at their Northern Virginia datacenter was first among several Cloud outages to gather major news coverage. Popular Websites and Web applications like FourSquare, Reddit, and HootSuite simultaneously disappeared from the web as Amazon and its customers struggled to recover.

Critics of the Cloud were quick to point to this outage as evidence that the Cloud can’t be trusted for business-critical Web applications. Whether or not the critics are right, the outages certainly raise serious questions about Cloud reliability for the prudent CIO.

While Cloud services introduce vast new options around flexibility and scalability, Operations teams still need to maintain a high level of diligence in designing Cloud architectures. When services are outsourced to the Cloud, it becomes easy to think of basic reliability concerns as “somebody else’s problem.” But that couldn’t be further from the truth. The best way to approach working with the Cloud is to ask the same questions about reliability that you would ask a traditional provider.

Planning

Before designing any architecture, whether hosted or Cloud-based, gather input from your colleagues to determine the expectations for your infrastructure.

Which Web applications are mission critical and require 100% uptime?
Are there back-end applications that can be down for a few days in the event of a disaster?
What are the costs of downtime or data loss?

Any project plan that doesn’t start with these basics is destined to failure – and you might be surprised at how many organizations forget to plan!

Provider

Now that you know what you’re looking for, begin looking at providers. Start by casting a broad net – select at least five providers that provide services meeting your needs. For all but the simplest projects, make sure to start with a real conversation with a real human being. Discuss initial pricing at this stage so you have a better idea of the market. Having a few providers in the picture will help keep everyone honest.

Locations

The first step in determining location must be based on answers to the following questions:

Does your application require certain latency or performance guarantees that will be impacted by network placement?
Does the facility meet Tier 3 or Tier 4 standards as set by the Uptime Institute?

If all of your users or visitors are in New York, it probably doesn’t make sense to put your datacenter in Los Angeles. Applications requiring 100% uptime must be hosted at more than one location, and subsequent locations must be geographically diverse. Even the best-managed facilities will occasionally have an unplanned emergency.

If your application requires 100% uptime, you should take your analysis of locations one step further.

Are there any predictable events that could impact multiple locations?

For example, a single winter storm could impact both Chicago and New York. Disasters are good at finding your weak point. Plan ahead.

The good news is with the Cloud model the backup capacity is cheap. You may have only a few servers—or none at all—running in the backup datacenter, with the ability to spin up more instances if a disaster occurs. As you work with providers, ask for more information about how they recommend configuring your disaster recovery. You may even want to consider using a different provider for your primary and disaster recovery environments – reducing the risk that a change in business direction of one of your providers impacts your services.

Network

Now that you have a good idea of the providers and locations, dig deeper into the network connectivity of the facility.

Is the provider connected to multiple “Tier 1” Internet providers?
What steps does the provider take to ensure that there aren’t single points of failure in their network access?

Data and Monitoring

By now, you should have a good idea of the questions you need to ask to ensure your Cloud provider is reliable. But there’s one more step you might forget – and it goes back again to that all-important planning stage. The best datacenter redundancy plan will have absolutely no value if you don’t have a documented, regularly tested process for failovers.

Where is your data stored?
Is every bit of critical data still accessible if your primary datacenter goes down?
How long will it take to transition to the DR datacenter – and can you improve that time?

A prudent Web Operations team will test its DR process on a quarterly basis, preferably by performing a full failover to DR and back. At a minimum, your team should sit together and walk through the process, even if if’s not practical to do a live failover.

Finally, don’t forget monitoring! How will you know if a critical service is offline? If you don’t find out about an outage until you arrive at work on Monday morning, all of your disaster recovery plans will be compromised.

Are all critical systems monitored?
Do the people getting the monitoring alert have a documented way to engage the disaster recovery process and communicate the status?

During an unfolding disaster is never a good time to realize you don’t have the phone numbers for your database team. Make sure you communicate contact information and processes in advance to all critical personnel.

The Cloud is perfect for companies looking to deploy scalable and reliable Websites and Web applications, but it doesn’t change the basics of good planning. Will 2012 be the year your company suffers a major outage due to a lack of redundancy and a good plan? Or will it be the year that you can report to your customers that your operations remained unaffected while CNN is reporting massive outages?

Reliability Image via Shutterstock