Reliability in the Cloud: Asking the Right Questions

Tweet

In 2011 the general public was introduced to the Cloud. Unfortunately, in many cases this introduction came as a result of the Cloud’s failures. Amazon’s April outage at their Northern Virginia datacenter was first among several Cloud outages to gather major news coverage. Popular Websites and Web applications like FourSquare, Reddit, and HootSuite simultaneously disappeared from the web as Amazon and its customers struggled to recover.

Critics of the Cloud were quick to point to this outage as evidence that the Cloud can’t be trusted for business-critical Web applications. Whether or not the critics are right, the outages certainly raise serious questions about Cloud reliability for the prudent CIO.

While Cloud services introduce vast new options around flexibility and scalability, Operations teams still need to maintain a high level of diligence in designing Cloud architectures. When services are outsourced to the Cloud, it becomes easy to think of basic reliability concerns as “somebody else’s problem.” But that couldn’t be further from the truth. The best way to approach working with the Cloud is to ask the same questions about reliability that you would ask a traditional provider.

Planning

Before designing any architecture, whether hosted or Cloud-based, gather input from your colleagues to determine the expectations for your infrastructure.

  • Which Web applications are mission critical and require 100% uptime?
  • Are there back-end applications that can be down for a few days in the event of a disaster?
  • What are the costs of downtime or data loss?

Any project plan that doesn’t start with these basics is destined to failure – and you might be surprised at how many organizations forget to plan!

Provider

Now that you know what you’re looking for, begin looking at providers. Start by casting a broad net – select at least five providers that provide services meeting your needs. For all but the simplest projects, make sure to start with a real conversation with a real human being. Discuss initial pricing at this stage so you have a better idea of the market. Having a few providers in the picture will help keep everyone honest.

Locations

The first step in determining location must be based on answers to the following questions:

  • Does your application require certain latency or performance guarantees that will be impacted by network placement?
  • Does the facility meet Tier 3 or Tier 4 standards as set by the Uptime Institute?

If all of your users or visitors are in New York, it probably doesn’t make sense to put your datacenter in Los Angeles. Applications requiring 100% uptime must be hosted at more than one location, and subsequent locations must be geographically diverse. Even the best-managed facilities will occasionally have an unplanned emergency.

If your application requires 100% uptime, you should take your analysis of locations one step further.

  • Are there any predictable events that could impact multiple locations?

For example, a single winter storm could impact both Chicago and New York. Disasters are good at finding your weak point. Plan ahead.

The good news is with the Cloud model the backup capacity is cheap. You may have only a few servers—or none at all—running in the backup datacenter, with the ability to spin up more instances if a disaster occurs. As you work with providers, ask for more information about how they recommend configuring your disaster recovery. You may even want to consider using a different provider for your primary and disaster recovery environments – reducing the risk that a change in business direction of one of your providers impacts your services.

Network

Now that you have a good idea of the providers and locations, dig deeper into the network connectivity of the facility.

  • Is the provider connected to multiple “Tier 1” Internet providers?
  • What steps does the provider take to ensure that there aren’t single points of failure in their network access?

Data and Monitoring

By now, you should have a good idea of the questions you need to ask to ensure your Cloud provider is reliable. But there’s one more step you might forget – and it goes back again to that all-important planning stage. The best datacenter redundancy plan will have absolutely no value if you don’t have a documented, regularly tested process for failovers.

  • Where is your data stored?
  • Is every bit of critical data still accessible if your primary datacenter goes down?
  • How long will it take to transition to the DR datacenter – and can you improve that time?

A prudent Web Operations team will test its DR process on a quarterly basis, preferably by performing a full failover to DR and back. At a minimum, your team should sit together and walk through the process, even if if’s not practical to do a live failover.

Finally, don’t forget monitoring! How will you know if a critical service is offline? If you don’t find out about an outage until you arrive at work on Monday morning, all of your disaster recovery plans will be compromised.

  • Are all critical systems monitored?
  • Do the people getting the monitoring alert have a documented way to engage the disaster recovery process and communicate the status?

During an unfolding disaster is never a good time to realize you don’t have the phone numbers for your database team. Make sure you communicate contact information and processes in advance to all critical personnel.

The Cloud is perfect for companies looking to deploy scalable and reliable Websites and Web applications, but it doesn’t change the basics of good planning. Will 2012 be the year your company suffers a major outage due to a lack of redundancy and a good plan? Or will it be the year that you can report to your customers that your operations remained unaffected while CNN is reporting massive outages?

Reliability Image via Shutterstock

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Chris Weekly

    Great article!
    Along these lines (taking responsibility for availability in the cloud), I loved George Reese’s take on it too:

    http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html

    “… If you think this week exposed weakness in the cloud, you don’t get it: it was the cloud’s shining moment, exposing the strength of cloud computing.

    In short, if your systems failed in the Amazon cloud this week, it
    wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.

    The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS.”

  • Mark S

    Excellent Article. It may seem obvious now, but the same operational principals of fault tolerance apply to the cloud as well.

  • slawek22

    Well Jason, clouds and their promises are just snake oil.

    Just look at the promises Amazon and others make. Redundancy, scalability, reliability. They’ll charge you normal price of machine x5 or even x10, and you get almost nothing in return.

    First you get some kind of NAS, which is no better for reliability than an ordinary RAID but painfullly SLOOOW for writes.

    Reliability & Redundancy – it’s just much better to monitor the datacenter so bad things won’t happen, let’s face it if something bad happens the application can’t be moved automatically. You’ll be waiting HOURS or DAYS for this to happen. If the server die – someone needs to look at it otherwise you could bring back corrupted data or db tables and this, in 99% of cases will happen. You’ll be bringing your application to (at best) old or (at worst) corrupted state. Someone have to look at it, if you haven’t designed your app with this in mind (atomic writes, FK’s, etc.). Even if you did, some admin needs to take a look to see what is lost and if you can bring this back (maybe from some logs).

    The only good thing is that your app could be moved between nodes so if one fails, your app should be moved to another one quickly (probably with some data corruption or loss in the process). So again, nothing really good about this.

    Cost cuts is also another myth. Because there is no known tool that could analyze performance and move your app between nodes automatically. It’ll in fact require very skilled admin to monitor resources usage in real time. Beside are 10-20 seconds downtimes even worth saving a couple of bucks? When you can get 1GB of RAM for a price of Big Mac?

    And now the biggest myth. The scalability. Cloud applications can’t scale out of the box. It’s just impossible. No available technology can do this automatically, actually the best way to scale now is to buy better hardware, then scale vertically. No known solution can do this. Even if you use (very fashionable) noSQL databases that provide vertical scaling the amount of network connections needed to fetch all necessary pieces of data will kill you sooner or later. Your application will be eventually making millions of keeping thousands of TCP connections.

    @Chris:
    It’s just fanboyism. What you’re saying is that your application availability is not dependent on managed services provider, but cloud services provider, which is, in fact completely the same thing.

    Even funnier is the fact that you say “it’s reliable because it broken”. I guess it’s also scalable because it can’t (like anything else) automatically scale, and it’s cheap because it’s expensive.

    Kepping availability is a very hard task. Look for example at amazon or google’s cloud. At Amazon the availability is ok at best (you can get as high AV on any mediocre dedic provider), at google it’s just appaling. There is no day when they have no problems with storage, database, backend or something else. When you put thousands of apps to run and compete for resources on single system (like in google’s real cloud) it will just never work reliably. It’s like hosting your solution on a shared server full of script kiddies. Any serious app needs to have isolated storage, cpu and ram.