Our Experience with the AWS Outage

By now it is old news that Amazon had some trouble in one of their data centers causing many companies with high-profile web sites to suffer some downtime. Here at Spanning, we are big fans of the service that Amazon offers, and if nothing else have more conviction that Amazon is a strong partner as a result of this recent outage.

Since our service is built as a cloud offering, we have gone to great lengths to take advantage of the scalability and fail-over capabilities that Amazon offers. None of this work gets any visible exposure, but it is what our customers expect. Rather than approach Amazon as a simple hosting service, we embraced the “expect things to fail” mantra.

In the rush to move to a cloud hosted model, businesses need to consider what it is that they are trying to accomplish. Certainly it is easy to minimize capital costs by leveraging a cloud provider’s infrastructure. It is also possible to maximize the resilience of your applications by moving them to the cloud. But you can’t do both at the same time—fault-tolerance doesn’t come cheap. You either need to build it into your application or invest in redundant infrastructure. We’ve built it into our app. And doing so buys us not only reliability but also scalability, flexibility, and in the long run, efficiency.

Ultimately there is a business decision that needs to be made. How much are you willing to spend up front to prepare for the inevitable failure, versus how much downtime can your business afford? As you would expect, the less downtime you can absorb, the higher the upfront cost and the closer you get to zero downtime, the faster the up-front cost grows.

Here at Spanning we like to say that “backups are fine, but what really counts is the restore.” There is a lot of truth to this simple statement. It covers an enormous amount of planning:

  • What are you backing up? Have you really considered what information is critical to your business?
  • How will you get access to your backup? Do you need to call the IT department and have them sort through offline tapes, or is there a self service portal that you can use?
  • When you restore, what else do you need to fix/update/reset?
  • Do you have a plan for failure and is it written down on paper? Remember that when things go dark you will likely not have access to your wiki.

The Amazon outage should be seen as a wakeup call to application developers and architects and not as a sign that cloud computing is unsafe. On the contrary, for any but the largest of applications, there is no cost effective alternative for the level of stability and scalability that can be had with a cloud services offering.